Matches in SemOpenAlex for { <https://semopenalex.org/work/W2883175344> ?p ?o ?g. }
Showing items 1 to 59 of
59
with 100 items per page.
- W2883175344 abstract "Script consists of 2 parts:article parseralignerRequired software (install before using script):yalignadditional Ubuntu packages:mongodbipythonpython-nosepython-werkzeugWiki article parserArticle parser works in 2 steps:Extracts articles from wiki dumpsSaves extracted articles to local DB (Mongo DB)Before using parser, wiki dumps should be downloaded and extracted to some directory (directory should contain *.xml, *.sql files). For each 2 dump files should be downloaded - articles and link dumps, here is examples:PL:http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-langlinks.sql.gzEN:http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-langlinks.sql.gzIMPORTANT NOTE: Engilsh dumps after extraction will require about 50 Gb of free space. During parsing parser can require up to 8 Gb ram.Article parser have option language - its for which articles extracted from other languages only if it exist in main language. Eg. if main is PL, then article extractor first extracts all article for PL, then article for other languages and only if such articles exists in PL translation. This reduces space requirements.For help use:$ python parse_wiki_dumps.py -hExample command:$ python parse_wiki_dumps.py -d ~/temp/wikipedia_dump/ -l pl -vWikipedia alignerAligner can be used when article extracted from dumps.Aligner takes article pairs for given pair, aligns text and saves parallel corpara to 2 files. Option -s can be used to limit number of symbols in file (by default size is 50000000 symbols, thats around 50-60Mb)By default aligner tries to continue aligning where it was stopped, to force aligning from begining need to use --restart keyFor help use:$ python align.py -hExample command:$ python align.py -o wikipedia -l en-pl -vEuronews crawlerCrawler finds links to articles using euronews archive http://euronews.com/2004/, and in parallel extracts and saves article texts to DB.For help use:$ python parse_euronews.py -hExample command:$ python parse_euronews.py -l en,pl -vEuronews alignerStarting aligner for euronews articles:$ python align.py -o euronews -l en-pl -vSaving articles in plain textScript can be used to save all articles in plain text format, it accepts path for saving articles, languages of articles to be saved, and source of articles (euronews, wikipedia).For help use:$ python save_plain_text.py -hExample command:$ python save_plain_text.py -l en,pl -r [path] -o euronewsYalign selectionThis script tries random parameters for model of yalign in order to get best parameters for aligning provided text samples.Before using yalign_selection script need to prepare article samples using prepare_random_sampling.py script.Creating folder with article samples can be done with this command:$ python prepare_random_sampling.py -o wikipedia -c 10 -l ru-en -v-o wikipedia - source of articles can be wikipedia or euronews-c 10 - number of articles to extract-l ru-en - languages to extractThis script will create article_samples folder with articles files, then you can create manually aligned files (you need align article of second language), for this example you need to align file, files named _orig - should be left unmodifiedThen manual aligning is ready you can run selection script here is example:$ python yalign_selection.py --samples article_samples/ --lang1 ru --lang2 en --threshold 0.1536422609112349e-6 --threshold_step 0.0000001 --threshold_step_count 10 --penalty 0.014928930455303857 --penalty_step 0.0001 --penalty_step_count 1 -m ru-enHere is what each parameter means:--samples article_samples/ - path to article samples folder--lang1 ru --lang2 en - languages to align (articles of second should be aligned manually, script will be using ??_orig files, align them automatically and will compare with manually aligned)--threshold 0.1536422609112349e-6 - threshold value of model, selection will be made around this value--threshold_step 0.0000001 - step of changing value--threshold_step_count 10 - number of steps to check below and above vaule, eg if value 10, step 1, and count 2, script will check 8 9 10 11 12same parameters for penalty-m ru-en - path to yalign modelAlso you can use (to tweak comparison of text lines in files):--length and --similarity --length - min diffirence in length in order to mark lines similar, 1 - same length, 0.5 - at least half of length --similarity - similarity of text in lines, 1 - exactly same, 0 - completely different. For similarity check sentences compared as sequence of characters.It has multiprocessing support already. Use -t option to set number of threads, by default it sets number of threads equal to number of CPU.for additional parameters you can use '-h' key.Then yalign_selection.py script will finish work it will produce csv file, with first column equal to threshold, second column equal to penalty, and third is similarity for this parameters.Align with HUNALING methodIn order to use hunalign you need add --hunalign option in align.py script, here is example:$ python align.py -l li-hu -r align_result -o wikipedia --hunalignIn my empirical study it provides better results when articles are translations of each other or simillar in leghth and content.Align From fodlerFor aligning already aligned texts using hunalign:Command exmaple is:$ python align_aligned_using_hunalign.py source/ target/Final infoWolk, K., & Marasek, K. (2015, September). Tuned and GPU-accelerated parallel data mining from comparable corpora. In International Conference on Text, Speech, and Dialogue (pp. 32-40). Springer International Publishing.http://arxiv.org/pdf/1509.08639For more detailed usage instruction see howto.pdf.For any questions: | Krzysztof Wolk | krzysztof@wolk.pl" @default.
- W2883175344 created "2018-08-03" @default.
- W2883175344 creator A5030149703 @default.
- W2883175344 date "2018-07-18" @default.
- W2883175344 modified "2023-09-23" @default.
- W2883175344 title "Parallel Corpora from Comparable Corpora tool" @default.
- W2883175344 hasPublicationYear "2018" @default.
- W2883175344 type Work @default.
- W2883175344 sameAs 2883175344 @default.
- W2883175344 citedByCount "0" @default.
- W2883175344 crossrefType "journal-article" @default.
- W2883175344 hasAuthorship W2883175344A5030149703 @default.
- W2883175344 hasConcept C111919701 @default.
- W2883175344 hasConcept C136764020 @default.
- W2883175344 hasConcept C186644900 @default.
- W2883175344 hasConcept C199360897 @default.
- W2883175344 hasConcept C2777683733 @default.
- W2883175344 hasConcept C41008148 @default.
- W2883175344 hasConcept C510870499 @default.
- W2883175344 hasConcept C519991488 @default.
- W2883175344 hasConcept C77088390 @default.
- W2883175344 hasConcept C8797682 @default.
- W2883175344 hasConceptScore W2883175344C111919701 @default.
- W2883175344 hasConceptScore W2883175344C136764020 @default.
- W2883175344 hasConceptScore W2883175344C186644900 @default.
- W2883175344 hasConceptScore W2883175344C199360897 @default.
- W2883175344 hasConceptScore W2883175344C2777683733 @default.
- W2883175344 hasConceptScore W2883175344C41008148 @default.
- W2883175344 hasConceptScore W2883175344C510870499 @default.
- W2883175344 hasConceptScore W2883175344C519991488 @default.
- W2883175344 hasConceptScore W2883175344C77088390 @default.
- W2883175344 hasConceptScore W2883175344C8797682 @default.
- W2883175344 hasLocation W28831753441 @default.
- W2883175344 hasOpenAccess W2883175344 @default.
- W2883175344 hasPrimaryLocation W28831753441 @default.
- W2883175344 hasRelatedWork W116427613 @default.
- W2883175344 hasRelatedWork W1506839621 @default.
- W2883175344 hasRelatedWork W1751084107 @default.
- W2883175344 hasRelatedWork W2186180326 @default.
- W2883175344 hasRelatedWork W2188471191 @default.
- W2883175344 hasRelatedWork W2221019170 @default.
- W2883175344 hasRelatedWork W2286895783 @default.
- W2883175344 hasRelatedWork W234833970 @default.
- W2883175344 hasRelatedWork W2483934409 @default.
- W2883175344 hasRelatedWork W2507267608 @default.
- W2883175344 hasRelatedWork W2755577984 @default.
- W2883175344 hasRelatedWork W2905203060 @default.
- W2883175344 hasRelatedWork W2912411735 @default.
- W2883175344 hasRelatedWork W3127699831 @default.
- W2883175344 hasRelatedWork W418364874 @default.
- W2883175344 hasRelatedWork W618288093 @default.
- W2883175344 hasRelatedWork W624632001 @default.
- W2883175344 hasRelatedWork W64150 @default.
- W2883175344 hasRelatedWork W2815752107 @default.
- W2883175344 hasRelatedWork W2976207458 @default.
- W2883175344 isParatext "false" @default.
- W2883175344 isRetracted "false" @default.
- W2883175344 magId "2883175344" @default.
- W2883175344 workType "article" @default.