Matches in SemOpenAlex for { <https://semopenalex.org/work/W4386875581> ?p ?o ?g. }
Showing items 1 to 59 of
59
with 100 items per page.
- W4386875581 abstract "The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX." @default.
- W4386875581 created "2023-09-20" @default.
- W4386875581 creator A5009957887 @default.
- W4386875581 creator A5025429868 @default.
- W4386875581 creator A5028863551 @default.
- W4386875581 creator A5033851224 @default.
- W4386875581 creator A5048500413 @default.
- W4386875581 creator A5061796051 @default.
- W4386875581 creator A5070047759 @default.
- W4386875581 creator A5087736585 @default.
- W4386875581 date "2023-09-17" @default.
- W4386875581 modified "2023-09-27" @default.
- W4386875581 title "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages" @default.
- W4386875581 doi "https://doi.org/10.48550/arxiv.2309.09400" @default.
- W4386875581 hasPublicationYear "2023" @default.
- W4386875581 type Work @default.
- W4386875581 citedByCount "0" @default.
- W4386875581 crossrefType "posted-content" @default.
- W4386875581 hasAuthorship W4386875581A5009957887 @default.
- W4386875581 hasAuthorship W4386875581A5025429868 @default.
- W4386875581 hasAuthorship W4386875581A5028863551 @default.
- W4386875581 hasAuthorship W4386875581A5033851224 @default.
- W4386875581 hasAuthorship W4386875581A5048500413 @default.
- W4386875581 hasAuthorship W4386875581A5061796051 @default.
- W4386875581 hasAuthorship W4386875581A5070047759 @default.
- W4386875581 hasAuthorship W4386875581A5087736585 @default.
- W4386875581 hasBestOaLocation W43868755811 @default.
- W4386875581 hasConcept C136764020 @default.
- W4386875581 hasConcept C154945302 @default.
- W4386875581 hasConcept C2522767166 @default.
- W4386875581 hasConcept C2780233690 @default.
- W4386875581 hasConcept C32587265 @default.
- W4386875581 hasConcept C38652104 @default.
- W4386875581 hasConcept C41008148 @default.
- W4386875581 hasConcept C77088390 @default.
- W4386875581 hasConceptScore W4386875581C136764020 @default.
- W4386875581 hasConceptScore W4386875581C154945302 @default.
- W4386875581 hasConceptScore W4386875581C2522767166 @default.
- W4386875581 hasConceptScore W4386875581C2780233690 @default.
- W4386875581 hasConceptScore W4386875581C32587265 @default.
- W4386875581 hasConceptScore W4386875581C38652104 @default.
- W4386875581 hasConceptScore W4386875581C41008148 @default.
- W4386875581 hasConceptScore W4386875581C77088390 @default.
- W4386875581 hasLocation W43868755811 @default.
- W4386875581 hasOpenAccess W4386875581 @default.
- W4386875581 hasPrimaryLocation W43868755811 @default.
- W4386875581 hasRelatedWork W2313358680 @default.
- W4386875581 hasRelatedWork W2320323244 @default.
- W4386875581 hasRelatedWork W2515214618 @default.
- W4386875581 hasRelatedWork W2547510008 @default.
- W4386875581 hasRelatedWork W2736204053 @default.
- W4386875581 hasRelatedWork W2748952813 @default.
- W4386875581 hasRelatedWork W2766145069 @default.
- W4386875581 hasRelatedWork W2802582576 @default.
- W4386875581 hasRelatedWork W3003602898 @default.
- W4386875581 hasRelatedWork W3003962028 @default.
- W4386875581 isParatext "false" @default.
- W4386875581 isRetracted "false" @default.
- W4386875581 workType "article" @default.