Matches in SemOpenAlex for { <https://semopenalex.org/work/W4379538464> ?p ?o ?g. }
Showing items 1 to 51 of
51
with 100 items per page.
- W4379538464 abstract "Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality. The approach uses Huffman coding to tokenize words, by order of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for 90%-95% of the scores reached by BPE, hence compositionality has less importance than previously thought." @default.
- W4379538464 created "2023-06-07" @default.
- W4379538464 creator A5035599291 @default.
- W4379538464 creator A5044462851 @default.
- W4379538464 creator A5063528547 @default.
- W4379538464 creator A5092102976 @default.
- W4379538464 creator A5092102977 @default.
- W4379538464 date "2023-06-02" @default.
- W4379538464 modified "2023-09-25" @default.
- W4379538464 title "Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT" @default.
- W4379538464 doi "https://doi.org/10.48550/arxiv.2306.01393" @default.
- W4379538464 hasPublicationYear "2023" @default.
- W4379538464 type Work @default.
- W4379538464 citedByCount "0" @default.
- W4379538464 crossrefType "posted-content" @default.
- W4379538464 hasAuthorship W4379538464A5035599291 @default.
- W4379538464 hasAuthorship W4379538464A5044462851 @default.
- W4379538464 hasAuthorship W4379538464A5063528547 @default.
- W4379538464 hasAuthorship W4379538464A5092102976 @default.
- W4379538464 hasAuthorship W4379538464A5092102977 @default.
- W4379538464 hasBestOaLocation W43795384641 @default.
- W4379538464 hasConcept C121375916 @default.
- W4379538464 hasConcept C154945302 @default.
- W4379538464 hasConcept C176982825 @default.
- W4379538464 hasConcept C204321447 @default.
- W4379538464 hasConcept C28490314 @default.
- W4379538464 hasConcept C41008148 @default.
- W4379538464 hasConcept C520968082 @default.
- W4379538464 hasConceptScore W4379538464C121375916 @default.
- W4379538464 hasConceptScore W4379538464C154945302 @default.
- W4379538464 hasConceptScore W4379538464C176982825 @default.
- W4379538464 hasConceptScore W4379538464C204321447 @default.
- W4379538464 hasConceptScore W4379538464C28490314 @default.
- W4379538464 hasConceptScore W4379538464C41008148 @default.
- W4379538464 hasConceptScore W4379538464C520968082 @default.
- W4379538464 hasLocation W43795384641 @default.
- W4379538464 hasOpenAccess W4379538464 @default.
- W4379538464 hasPrimaryLocation W43795384641 @default.
- W4379538464 hasRelatedWork W2018369711 @default.
- W4379538464 hasRelatedWork W2042474027 @default.
- W4379538464 hasRelatedWork W2098603082 @default.
- W4379538464 hasRelatedWork W2251027649 @default.
- W4379538464 hasRelatedWork W2505414515 @default.
- W4379538464 hasRelatedWork W2518324938 @default.
- W4379538464 hasRelatedWork W2757988102 @default.
- W4379538464 hasRelatedWork W2889764925 @default.
- W4379538464 hasRelatedWork W3020843266 @default.
- W4379538464 hasRelatedWork W3135646670 @default.
- W4379538464 isParatext "false" @default.
- W4379538464 isRetracted "false" @default.
- W4379538464 workType "article" @default.