Matches in SemOpenAlex for { <https://semopenalex.org/work/W4225545910> ?p ?o ?g. }
- W4225545910 abstract "Abstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged $$F_{1}$$ <mml:math xmlns:mml=http://www.w3.org/1998/Math/MathML><mml:msub><mml:mi>F</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math> score. Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. Conclusion As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered." @default.
- W4225545910 created "2022-05-05" @default.
- W4225545910 creator A5003788131 @default.
- W4225545910 creator A5027770821 @default.
- W4225545910 creator A5039684052 @default.
- W4225545910 creator A5048955398 @default.
- W4225545910 creator A5053215726 @default.
- W4225545910 date "2021-11-01" @default.
- W4225545910 modified "2023-09-30" @default.
- W4225545910 title "When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification" @default.
- W4225545910 cites W1560851690 @default.
- W4225545910 cites W1994929347 @default.
- W4225545910 cites W2025428542 @default.
- W4225545910 cites W2093157872 @default.
- W4225545910 cites W2100676408 @default.
- W4225545910 cites W2109206523 @default.
- W4225545910 cites W2112950297 @default.
- W4225545910 cites W2132724073 @default.
- W4225545910 cites W2153064780 @default.
- W4225545910 cites W2157549840 @default.
- W4225545910 cites W2157807817 @default.
- W4225545910 cites W2190333735 @default.
- W4225545910 cites W2467995757 @default.
- W4225545910 cites W2557074642 @default.
- W4225545910 cites W2620787630 @default.
- W4225545910 cites W2743028754 @default.
- W4225545910 cites W2902516827 @default.
- W4225545910 cites W2911489562 @default.
- W4225545910 cites W2923014074 @default.
- W4225545910 cites W2927032858 @default.
- W4225545910 cites W2953356739 @default.
- W4225545910 cites W2956394034 @default.
- W4225545910 cites W2963716420 @default.
- W4225545910 cites W2971258845 @default.
- W4225545910 cites W2976476443 @default.
- W4225545910 cites W2979920993 @default.
- W4225545910 cites W2991755410 @default.
- W4225545910 cites W2993873509 @default.
- W4225545910 cites W3011762034 @default.
- W4225545910 cites W3013605954 @default.
- W4225545910 cites W3023618320 @default.
- W4225545910 cites W3025400983 @default.
- W4225545910 cites W3034238904 @default.
- W4225545910 cites W3047797631 @default.
- W4225545910 cites W3105705953 @default.
- W4225545910 cites W96276655 @default.
- W4225545910 doi "https://doi.org/10.1186/s12911-022-01829-2" @default.
- W4225545910 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/35382811" @default.
- W4225545910 hasPublicationYear "2021" @default.
- W4225545910 type Work @default.
- W4225545910 citedByCount "4" @default.
- W4225545910 countsByYear W42255459102022 @default.
- W4225545910 countsByYear W42255459102023 @default.
- W4225545910 crossrefType "journal-article" @default.
- W4225545910 hasAuthorship W4225545910A5003788131 @default.
- W4225545910 hasAuthorship W4225545910A5027770821 @default.
- W4225545910 hasAuthorship W4225545910A5039684052 @default.
- W4225545910 hasAuthorship W4225545910A5048955398 @default.
- W4225545910 hasAuthorship W4225545910A5053215726 @default.
- W4225545910 hasBestOaLocation W42255459101 @default.
- W4225545910 hasConcept C111919701 @default.
- W4225545910 hasConcept C118505674 @default.
- W4225545910 hasConcept C119857082 @default.
- W4225545910 hasConcept C121332964 @default.
- W4225545910 hasConcept C137293760 @default.
- W4225545910 hasConcept C150899416 @default.
- W4225545910 hasConcept C151730666 @default.
- W4225545910 hasConcept C153083717 @default.
- W4225545910 hasConcept C154945302 @default.
- W4225545910 hasConcept C165801399 @default.
- W4225545910 hasConcept C204321447 @default.
- W4225545910 hasConcept C207685749 @default.
- W4225545910 hasConcept C2779343474 @default.
- W4225545910 hasConcept C41008148 @default.
- W4225545910 hasConcept C62520636 @default.
- W4225545910 hasConcept C66322947 @default.
- W4225545910 hasConcept C86803240 @default.
- W4225545910 hasConceptScore W4225545910C111919701 @default.
- W4225545910 hasConceptScore W4225545910C118505674 @default.
- W4225545910 hasConceptScore W4225545910C119857082 @default.
- W4225545910 hasConceptScore W4225545910C121332964 @default.
- W4225545910 hasConceptScore W4225545910C137293760 @default.
- W4225545910 hasConceptScore W4225545910C150899416 @default.
- W4225545910 hasConceptScore W4225545910C151730666 @default.
- W4225545910 hasConceptScore W4225545910C153083717 @default.
- W4225545910 hasConceptScore W4225545910C154945302 @default.
- W4225545910 hasConceptScore W4225545910C165801399 @default.
- W4225545910 hasConceptScore W4225545910C204321447 @default.
- W4225545910 hasConceptScore W4225545910C207685749 @default.
- W4225545910 hasConceptScore W4225545910C2779343474 @default.
- W4225545910 hasConceptScore W4225545910C41008148 @default.
- W4225545910 hasConceptScore W4225545910C62520636 @default.
- W4225545910 hasConceptScore W4225545910C66322947 @default.
- W4225545910 hasConceptScore W4225545910C86803240 @default.
- W4225545910 hasFunder F4320322725 @default.
- W4225545910 hasFunder F4320337372 @default.
- W4225545910 hasFunder F4320337389 @default.
- W4225545910 hasIssue "S9" @default.
- W4225545910 hasLocation W42255459101 @default.
- W4225545910 hasLocation W42255459102 @default.