Matches in SemOpenAlex for { <https://semopenalex.org/work/W4382600162> ?p ?o ?g. }
Showing items 1 to 60 of
60
with 100 items per page.
- W4382600162 abstract "Subword Tokenization of Noisy Housing Defect Complaints for Named Entity Recognition Kahyun Jeon, Ghang Lee Pages 418-425 (2023 Proceedings of the 40th ISARC, Chennai, India, ISBN 978-0-6458322-0-4, ISSN 2413-5844) Abstract: In domain-specific named entity recognition (NER), the out-of-vocabulary (OOV) problem arises due to linguistic features and rare vocabulary. OOV problem is particularly challenging in agglutinative languages such as Korean. The irregular decomposition of morphemes makes it difficult to represent all of them in language model dictionaries, resulting in poor NER performance. Subword tokenization which segments a word into atomic tokens that are no longer divided can be one of the possible solutions. In the construction industry, existing NER methods do not effective on housing defect complaints which contain many rare words, including jargon, slang, and typos. To address this challenge, we propose subword tokenization algorithms that can mitigate OOV problems based on considering linguistic features and pre-trained language models (PLMs). The primary objective of this study is to identify the optimal NER performance by comparing different subword tokenization methods depending on the language models used. For domain-specific NER, we defined and used 23 defect-specific named entity tags for dataset labelling. We then experimented with a total of three state-of-the-art language models: one SentencePiece-based and two WordPiece-based subword tokenization models. The results demonstrate that the SentencePiece-based Korean Bidirectional Encoder Representations from Transformers (KoBERT) outperformed the two WordPiece-based language models (multilingual-BERT and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA)) with an F1 score of 84.7%. The proposed method is expected to improve not only NER but also other downstream tasks that involve using Korean documents in the construction industry. Keywords: Out of vocabulary (OOV), Subword tokenization, WordPiece, SentencePiece, Named entity recognition (NER), Construction defect management DOI: https://doi.org/10.22260/ISARC2023/0057 Download fulltext Download BibTex Download Endnote (RIS) TeX Import to Mendeley Presentation Video: https://youtu.be/5n4syU3bMSI" @default.
- W4382600162 created "2023-06-30" @default.
- W4382600162 creator A5054970477 @default.
- W4382600162 creator A5075755166 @default.
- W4382600162 date "2023-07-07" @default.
- W4382600162 modified "2023-09-25" @default.
- W4382600162 title "Subword Tokenization of Noisy Housing Defect Complaints for Named Entity Recognition" @default.
- W4382600162 doi "https://doi.org/10.22260/isarc2023/0057" @default.
- W4382600162 hasPublicationYear "2023" @default.
- W4382600162 type Work @default.
- W4382600162 citedByCount "0" @default.
- W4382600162 crossrefType "proceedings-article" @default.
- W4382600162 hasAuthorship W4382600162A5054970477 @default.
- W4382600162 hasAuthorship W4382600162A5075755166 @default.
- W4382600162 hasConcept C137293760 @default.
- W4382600162 hasConcept C138885662 @default.
- W4382600162 hasConcept C154945302 @default.
- W4382600162 hasConcept C162324750 @default.
- W4382600162 hasConcept C165297611 @default.
- W4382600162 hasConcept C176982825 @default.
- W4382600162 hasConcept C187736073 @default.
- W4382600162 hasConcept C204321447 @default.
- W4382600162 hasConcept C2777601683 @default.
- W4382600162 hasConcept C2779135771 @default.
- W4382600162 hasConcept C2780451532 @default.
- W4382600162 hasConcept C28490314 @default.
- W4382600162 hasConcept C41008148 @default.
- W4382600162 hasConcept C41895202 @default.
- W4382600162 hasConcept C80875076 @default.
- W4382600162 hasConceptScore W4382600162C137293760 @default.
- W4382600162 hasConceptScore W4382600162C138885662 @default.
- W4382600162 hasConceptScore W4382600162C154945302 @default.
- W4382600162 hasConceptScore W4382600162C162324750 @default.
- W4382600162 hasConceptScore W4382600162C165297611 @default.
- W4382600162 hasConceptScore W4382600162C176982825 @default.
- W4382600162 hasConceptScore W4382600162C187736073 @default.
- W4382600162 hasConceptScore W4382600162C204321447 @default.
- W4382600162 hasConceptScore W4382600162C2777601683 @default.
- W4382600162 hasConceptScore W4382600162C2779135771 @default.
- W4382600162 hasConceptScore W4382600162C2780451532 @default.
- W4382600162 hasConceptScore W4382600162C28490314 @default.
- W4382600162 hasConceptScore W4382600162C41008148 @default.
- W4382600162 hasConceptScore W4382600162C41895202 @default.
- W4382600162 hasConceptScore W4382600162C80875076 @default.
- W4382600162 hasLocation W43826001621 @default.
- W4382600162 hasOpenAccess W4382600162 @default.
- W4382600162 hasPrimaryLocation W43826001621 @default.
- W4382600162 hasRelatedWork W2111082276 @default.
- W4382600162 hasRelatedWork W2114358883 @default.
- W4382600162 hasRelatedWork W2359001871 @default.
- W4382600162 hasRelatedWork W2759980945 @default.
- W4382600162 hasRelatedWork W2773616286 @default.
- W4382600162 hasRelatedWork W2947417049 @default.
- W4382600162 hasRelatedWork W3136915866 @default.
- W4382600162 hasRelatedWork W4285014497 @default.
- W4382600162 hasRelatedWork W4323240841 @default.
- W4382600162 hasRelatedWork W4382600162 @default.
- W4382600162 isParatext "false" @default.
- W4382600162 isRetracted "false" @default.
- W4382600162 workType "article" @default.