Matches in SemOpenAlex for { <https://semopenalex.org/work/W4385990940> ?p ?o ?g. }
Showing items 1 to 97 of
97
with 100 items per page.
- W4385990940 endingPage "365" @default.
- W4385990940 startingPage "348" @default.
- W4385990940 abstract "In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analyzed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models." @default.
- W4385990940 created "2023-08-19" @default.
- W4385990940 creator A5008212803 @default.
- W4385990940 creator A5011962351 @default.
- W4385990940 creator A5054685387 @default.
- W4385990940 creator A5077188385 @default.
- W4385990940 creator A5079629977 @default.
- W4385990940 date "2023-01-01" @default.
- W4385990940 modified "2023-10-14" @default.
- W4385990940 title "CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data" @default.
- W4385990940 cites W1966382373 @default.
- W4385990940 cites W2801930304 @default.
- W4385990940 cites W2962772269 @default.
- W4385990940 cites W3003711898 @default.
- W4385990940 cites W3034238904 @default.
- W4385990940 cites W3104953317 @default.
- W4385990940 cites W3113497636 @default.
- W4385990940 cites W3113753692 @default.
- W4385990940 cites W3169483174 @default.
- W4385990940 cites W3174269049 @default.
- W4385990940 cites W3175301726 @default.
- W4385990940 cites W3176851559 @default.
- W4385990940 cites W3202839357 @default.
- W4385990940 cites W3204562006 @default.
- W4385990940 cites W3206996280 @default.
- W4385990940 cites W3213241618 @default.
- W4385990940 cites W4320482280 @default.
- W4385990940 doi "https://doi.org/10.1007/978-3-031-41682-8_22" @default.
- W4385990940 hasPublicationYear "2023" @default.
- W4385990940 type Work @default.
- W4385990940 citedByCount "0" @default.
- W4385990940 crossrefType "book-chapter" @default.
- W4385990940 hasAuthorship W4385990940A5008212803 @default.
- W4385990940 hasAuthorship W4385990940A5011962351 @default.
- W4385990940 hasAuthorship W4385990940A5054685387 @default.
- W4385990940 hasAuthorship W4385990940A5077188385 @default.
- W4385990940 hasAuthorship W4385990940A5079629977 @default.
- W4385990940 hasConcept C100368936 @default.
- W4385990940 hasConcept C105702510 @default.
- W4385990940 hasConcept C110875604 @default.
- W4385990940 hasConcept C111472728 @default.
- W4385990940 hasConcept C134306372 @default.
- W4385990940 hasConcept C136764020 @default.
- W4385990940 hasConcept C137293760 @default.
- W4385990940 hasConcept C138885662 @default.
- W4385990940 hasConcept C154945302 @default.
- W4385990940 hasConcept C199360897 @default.
- W4385990940 hasConcept C202444582 @default.
- W4385990940 hasConcept C204321447 @default.
- W4385990940 hasConcept C23123220 @default.
- W4385990940 hasConcept C2779530757 @default.
- W4385990940 hasConcept C33923547 @default.
- W4385990940 hasConcept C36503486 @default.
- W4385990940 hasConcept C41008148 @default.
- W4385990940 hasConcept C43521106 @default.
- W4385990940 hasConcept C71901391 @default.
- W4385990940 hasConcept C71924100 @default.
- W4385990940 hasConcept C9652623 @default.
- W4385990940 hasConceptScore W4385990940C100368936 @default.
- W4385990940 hasConceptScore W4385990940C105702510 @default.
- W4385990940 hasConceptScore W4385990940C110875604 @default.
- W4385990940 hasConceptScore W4385990940C111472728 @default.
- W4385990940 hasConceptScore W4385990940C134306372 @default.
- W4385990940 hasConceptScore W4385990940C136764020 @default.
- W4385990940 hasConceptScore W4385990940C137293760 @default.
- W4385990940 hasConceptScore W4385990940C138885662 @default.
- W4385990940 hasConceptScore W4385990940C154945302 @default.
- W4385990940 hasConceptScore W4385990940C199360897 @default.
- W4385990940 hasConceptScore W4385990940C202444582 @default.
- W4385990940 hasConceptScore W4385990940C204321447 @default.
- W4385990940 hasConceptScore W4385990940C23123220 @default.
- W4385990940 hasConceptScore W4385990940C2779530757 @default.
- W4385990940 hasConceptScore W4385990940C33923547 @default.
- W4385990940 hasConceptScore W4385990940C36503486 @default.
- W4385990940 hasConceptScore W4385990940C41008148 @default.
- W4385990940 hasConceptScore W4385990940C43521106 @default.
- W4385990940 hasConceptScore W4385990940C71901391 @default.
- W4385990940 hasConceptScore W4385990940C71924100 @default.
- W4385990940 hasConceptScore W4385990940C9652623 @default.
- W4385990940 hasLocation W43859909401 @default.
- W4385990940 hasOpenAccess W4385990940 @default.
- W4385990940 hasPrimaryLocation W43859909401 @default.
- W4385990940 hasRelatedWork W1542790140 @default.
- W4385990940 hasRelatedWork W2051833850 @default.
- W4385990940 hasRelatedWork W2144007828 @default.
- W4385990940 hasRelatedWork W2171573941 @default.
- W4385990940 hasRelatedWork W2385015894 @default.
- W4385990940 hasRelatedWork W3156164993 @default.
- W4385990940 hasRelatedWork W4243313575 @default.
- W4385990940 hasRelatedWork W4287845917 @default.
- W4385990940 hasRelatedWork W4321258516 @default.
- W4385990940 hasRelatedWork W4360873893 @default.
- W4385990940 isParatext "false" @default.
- W4385990940 isRetracted "false" @default.
- W4385990940 workType "book-chapter" @default.