Matches in SemOpenAlex for { <https://semopenalex.org/work/W3025490068> ?p ?o ?g. }
- W3025490068 abstract "Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available." @default.
- W3025490068 created "2020-05-21" @default.
- W3025490068 creator A5003777137 @default.
- W3025490068 creator A5041833974 @default.
- W3025490068 creator A5062124133 @default.
- W3025490068 date "2020-05-13" @default.
- W3025490068 modified "2023-09-27" @default.
- W3025490068 title "Parallel Corpus Filtering via Pre-trained Language Models" @default.
- W3025490068 cites W1834000468 @default.
- W3025490068 cites W1905522558 @default.
- W3025490068 cites W2117278770 @default.
- W3025490068 cites W2134800885 @default.
- W3025490068 cites W2147262247 @default.
- W3025490068 cites W2155607551 @default.
- W3025490068 cites W2211796614 @default.
- W3025490068 cites W2419539795 @default.
- W3025490068 cites W2496235729 @default.
- W3025490068 cites W2763856713 @default.
- W3025490068 cites W2773493195 @default.
- W3025490068 cites W2798389157 @default.
- W3025490068 cites W2902918014 @default.
- W3025490068 cites W2902949028 @default.
- W3025490068 cites W2903182367 @default.
- W3025490068 cites W2903297715 @default.
- W3025490068 cites W2960374072 @default.
- W3025490068 cites W2962735107 @default.
- W3025490068 cites W2963281280 @default.
- W3025490068 cites W2963341956 @default.
- W3025490068 cites W2963602293 @default.
- W3025490068 cites W2963626623 @default.
- W3025490068 cites W2963661177 @default.
- W3025490068 cites W2963919854 @default.
- W3025490068 cites W2970686691 @default.
- W3025490068 cites W3037465386 @default.
- W3025490068 cites W3203064768 @default.
- W3025490068 cites W630532510 @default.
- W3025490068 doi "https://doi.org/10.48550/arxiv.2005.06166" @default.
- W3025490068 hasPublicationYear "2020" @default.
- W3025490068 type Work @default.
- W3025490068 sameAs 3025490068 @default.
- W3025490068 citedByCount "4" @default.
- W3025490068 countsByYear W30254900682020 @default.
- W3025490068 countsByYear W30254900682021 @default.
- W3025490068 crossrefType "posted-content" @default.
- W3025490068 hasAuthorship W3025490068A5003777137 @default.
- W3025490068 hasAuthorship W3025490068A5041833974 @default.
- W3025490068 hasAuthorship W3025490068A5062124133 @default.
- W3025490068 hasBestOaLocation W30254900681 @default.
- W3025490068 hasConcept C104317684 @default.
- W3025490068 hasConcept C105580179 @default.
- W3025490068 hasConcept C106131492 @default.
- W3025490068 hasConcept C119857082 @default.
- W3025490068 hasConcept C137293760 @default.
- W3025490068 hasConcept C149364088 @default.
- W3025490068 hasConcept C154945302 @default.
- W3025490068 hasConcept C162324750 @default.
- W3025490068 hasConcept C185592680 @default.
- W3025490068 hasConcept C187736073 @default.
- W3025490068 hasConcept C203005215 @default.
- W3025490068 hasConcept C204321447 @default.
- W3025490068 hasConcept C2777530160 @default.
- W3025490068 hasConcept C2780451532 @default.
- W3025490068 hasConcept C2985367798 @default.
- W3025490068 hasConcept C31972630 @default.
- W3025490068 hasConcept C39890363 @default.
- W3025490068 hasConcept C41008148 @default.
- W3025490068 hasConcept C55493867 @default.
- W3025490068 hasConceptScore W3025490068C104317684 @default.
- W3025490068 hasConceptScore W3025490068C105580179 @default.
- W3025490068 hasConceptScore W3025490068C106131492 @default.
- W3025490068 hasConceptScore W3025490068C119857082 @default.
- W3025490068 hasConceptScore W3025490068C137293760 @default.
- W3025490068 hasConceptScore W3025490068C149364088 @default.
- W3025490068 hasConceptScore W3025490068C154945302 @default.
- W3025490068 hasConceptScore W3025490068C162324750 @default.
- W3025490068 hasConceptScore W3025490068C185592680 @default.
- W3025490068 hasConceptScore W3025490068C187736073 @default.
- W3025490068 hasConceptScore W3025490068C203005215 @default.
- W3025490068 hasConceptScore W3025490068C204321447 @default.
- W3025490068 hasConceptScore W3025490068C2777530160 @default.
- W3025490068 hasConceptScore W3025490068C2780451532 @default.
- W3025490068 hasConceptScore W3025490068C2985367798 @default.
- W3025490068 hasConceptScore W3025490068C31972630 @default.
- W3025490068 hasConceptScore W3025490068C39890363 @default.
- W3025490068 hasConceptScore W3025490068C41008148 @default.
- W3025490068 hasConceptScore W3025490068C55493867 @default.
- W3025490068 hasLocation W30254900681 @default.
- W3025490068 hasOpenAccess W3025490068 @default.
- W3025490068 hasPrimaryLocation W30254900681 @default.
- W3025490068 hasRelatedWork W1538473846 @default.
- W3025490068 hasRelatedWork W2759075081 @default.
- W3025490068 hasRelatedWork W2891836487 @default.
- W3025490068 hasRelatedWork W2946260846 @default.
- W3025490068 hasRelatedWork W2985215540 @default.
- W3025490068 hasRelatedWork W2990400634 @default.
- W3025490068 hasRelatedWork W3107474891 @default.
- W3025490068 hasRelatedWork W3164405410 @default.
- W3025490068 hasRelatedWork W61293283 @default.
- W3025490068 hasRelatedWork W970670907 @default.
- W3025490068 isParatext "false" @default.