Matches in SemOpenAlex for { <https://semopenalex.org/work/W3156216837> ?p ?o ?g. }
- W3156216837 abstract "As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text (e.g., this http URL, which contains a significant percentage of machine translated and/or OCR'd text), the effect that the filters had on the data (they disproportionately remove text in AAE), and evidence that some other benchmark NLP dataset examples are contained in the text. We release a web interface to an interactive, indexed copy of this dataset, encouraging the community to continuously explore and report additional findings." @default.
- W3156216837 created "2021-04-26" @default.
- W3156216837 creator A5008013895 @default.
- W3156216837 creator A5015128745 @default.
- W3156216837 creator A5035088083 @default.
- W3156216837 creator A5052307927 @default.
- W3156216837 creator A5059265033 @default.
- W3156216837 creator A5068360032 @default.
- W3156216837 creator A5087098432 @default.
- W3156216837 date "2021-04-18" @default.
- W3156216837 modified "2023-09-27" @default.
- W3156216837 title "Documenting the English Colossal Clean Crawled Corpus." @default.
- W3156216837 cites W131533222 @default.
- W3156216837 cites W1566289585 @default.
- W3156216837 cites W1599016936 @default.
- W3156216837 cites W1962695696 @default.
- W3156216837 cites W2092987299 @default.
- W3156216837 cites W2097590161 @default.
- W3156216837 cites W2099813784 @default.
- W3156216837 cites W2130158090 @default.
- W3156216837 cites W2251939518 @default.
- W3156216837 cites W2574745946 @default.
- W3156216837 cites W2734619116 @default.
- W3156216837 cites W2795038878 @default.
- W3156216837 cites W2799054028 @default.
- W3156216837 cites W2805206884 @default.
- W3156216837 cites W2888482885 @default.
- W3156216837 cites W2911227954 @default.
- W3156216837 cites W2960374072 @default.
- W3156216837 cites W2962788840 @default.
- W3156216837 cites W2963091658 @default.
- W3156216837 cites W2963341956 @default.
- W3156216837 cites W2963748441 @default.
- W3156216837 cites W2963846996 @default.
- W3156216837 cites W2964235839 @default.
- W3156216837 cites W2965373594 @default.
- W3156216837 cites W2969117066 @default.
- W3156216837 cites W2969958763 @default.
- W3156216837 cites W2970395295 @default.
- W3156216837 cites W2970476646 @default.
- W3156216837 cites W3001279689 @default.
- W3156216837 cites W3032816972 @default.
- W3156216837 cites W3034238904 @default.
- W3156216837 cites W3042518613 @default.
- W3156216837 cites W3095645723 @default.
- W3156216837 cites W3100355250 @default.
- W3156216837 cites W3104739822 @default.
- W3156216837 cites W3105425516 @default.
- W3156216837 cites W3112501082 @default.
- W3156216837 cites W3118781290 @default.
- W3156216837 cites W3119866685 @default.
- W3156216837 cites W3135371071 @default.
- W3156216837 cites W3173777717 @default.
- W3156216837 cites W3174269049 @default.
- W3156216837 cites W3190860428 @default.
- W3156216837 cites W2525127255 @default.
- W3156216837 hasPublicationYear "2021" @default.
- W3156216837 type Work @default.
- W3156216837 sameAs 3156216837 @default.
- W3156216837 citedByCount "13" @default.
- W3156216837 countsByYear W31562168372021 @default.
- W3156216837 countsByYear W31562168372022 @default.
- W3156216837 crossrefType "posted-content" @default.
- W3156216837 hasAuthorship W3156216837A5008013895 @default.
- W3156216837 hasAuthorship W3156216837A5015128745 @default.
- W3156216837 hasAuthorship W3156216837A5035088083 @default.
- W3156216837 hasAuthorship W3156216837A5052307927 @default.
- W3156216837 hasAuthorship W3156216837A5059265033 @default.
- W3156216837 hasAuthorship W3156216837A5068360032 @default.
- W3156216837 hasAuthorship W3156216837A5087098432 @default.
- W3156216837 hasConcept C13280743 @default.
- W3156216837 hasConcept C154945302 @default.
- W3156216837 hasConcept C177264268 @default.
- W3156216837 hasConcept C185798385 @default.
- W3156216837 hasConcept C199360897 @default.
- W3156216837 hasConcept C204321447 @default.
- W3156216837 hasConcept C205649164 @default.
- W3156216837 hasConcept C23123220 @default.
- W3156216837 hasConcept C2780719617 @default.
- W3156216837 hasConcept C41008148 @default.
- W3156216837 hasConcept C55282118 @default.
- W3156216837 hasConcept C56666940 @default.
- W3156216837 hasConcept C77088390 @default.
- W3156216837 hasConceptScore W3156216837C13280743 @default.
- W3156216837 hasConceptScore W3156216837C154945302 @default.
- W3156216837 hasConceptScore W3156216837C177264268 @default.
- W3156216837 hasConceptScore W3156216837C185798385 @default.
- W3156216837 hasConceptScore W3156216837C199360897 @default.
- W3156216837 hasConceptScore W3156216837C204321447 @default.
- W3156216837 hasConceptScore W3156216837C205649164 @default.
- W3156216837 hasConceptScore W3156216837C23123220 @default.
- W3156216837 hasConceptScore W3156216837C2780719617 @default.
- W3156216837 hasConceptScore W3156216837C41008148 @default.
- W3156216837 hasConceptScore W3156216837C55282118 @default.
- W3156216837 hasConceptScore W3156216837C56666940 @default.
- W3156216837 hasConceptScore W3156216837C77088390 @default.
- W3156216837 hasLocation W31562168371 @default.
- W3156216837 hasOpenAccess W3156216837 @default.
- W3156216837 hasPrimaryLocation W31562168371 @default.
- W3156216837 hasRelatedWork W1981379484 @default.