Matches in SemOpenAlex for { <https://semopenalex.org/work/W145425800> ?p ?o ?g. }
Showing items 1 to 99 of
99
with 100 items per page.
- W145425800 abstract "An Empirical Evaluation of Models of Text Document Similarity Michael D. Lee (michael.lee@adelaide.edu.au) Department of Psychology, University of Adelaide South Australia, 5005, AUSTRALIA Brandon Pincombe (brandon.pincombe@dsto.defence.gov.au) Intelligence Surveillance and Reconnaissance Division, Defence Science and Technology Organisation PO Box 1500, Edinburgh SA 5111 AUSTRALIA Matthew Welsh (matthew.welsh@adelaide.edu.au) Australian School of Petroleum Engineering, University of Adelaide South Australia, 5005, AUSTRALIA Abstract Modeling the semantic similarity between text docu- ments presents a significant theoretical challenge for cognitive science, with ready-made applications in in- formation handling and decision support systems deal- ing with text. While a number of candidate models exist, they have generally not been assessed in terms of their ability to emulate human judgments of simi- larity. To address this problem, we conducted an ex- periment that collected repeated similarity measures for each pair of documents in a small corpus of short news documents. An analysis of human performance showed inter-rater correlations of about 0.6. We then considered the ability of existing models—using word- based, n-gram and Latent Semantic Analysis (LSA) approaches—to model these human judgments. The best performed LSA model produced correlations of about 0.6, consistent with human performance, while the best performed word-based and n-gram models achieved correlations closer to 0.5. Many of the re- maining models showed almost no correlation with hu- man performance. Based on our results, we provide some discussion of the key strengths and weaknesses of the models we examined. Introduction Modeling the semantic similarity between text docu- ments is an interesting problem for cognitive science, for both theoretical and practical reasons. Theoret- ically, it involves the study of a basic cognitive pro- cess with richly structured natural stimuli. Practically, search engines, text corpus visualizations, and a vari- ety of other applications for filtering, sorting, retriev- ing, and generally handling text rely fundamentally on similarity measures. For this reason, the ability to as- sess semantic similarity in an accurate, automated, and scalable way is a key determinant of the effectiveness of most information handling and decision support soft- ware that deals with text. A variety of different approaches have been devel- oped for modeling text document similarity. These in- clude simple word-based, keyword-based and n-gram measures (e.g., Salton, 1989; Damashek, 1995), and more complicated approaches such as Latent Seman- tic Analysis (LSA: Deerwester et al., 1990; Landauer and Dumais, 1997). While all of these approaches have achieved some level of practical success, they have gen- erally not been assessed in terms of their ability to model human judgments of text document similarity. The most likely reason for this failure is that no suit- able empirical data exist, and considerable effort is in- volved in collecting pairwise ratings of text document similarity for even a moderate number of documents. This paper reports the collection of data that give ten independent ratings of the similarity of every pair of 50 short text documents, and so represents an attempt to establish a ‘psychological ground truth’ for evaluating models. Using the new data, we report a first eval- uation of the ability of word-based, n-gram and LSA approaches to model human judgments. Experiment Materials The text corpus evaluated by human judges contained 50 documents selected from the Australian Broadcast- ing Corporation’s news mail service, which provides text e-mails of headline stories. The documents varied in length from 51 to 126 words, and covered a number of broad topics. A further 314 documents from the same were collected to act as a larger ‘backgrounding’ corpus for LSA. Both document sets were assessed against a stan- dard corpus of five English texts using four models of language. These were the log-normal, generalized in- verse Gauss-Poisson (with γ = −0.5), Yule-Simon and Zipfian models (Baayen, 2001). Both document sets were within the normal range of English text for word frequency spectrum and vocabulary growth and were therefore regarded as representative of normal English texts. Subjects The subjects were 83 University of Adelaide students (29 males and 54 females), with a mean age of 19.7 years. They were each paid with a ten (Australian) dollar gift voucher for every 100 document pair ratings made." @default.
- W145425800 created "2016-06-24" @default.
- W145425800 creator A5059458561 @default.
- W145425800 creator A5069667659 @default.
- W145425800 creator A5081061405 @default.
- W145425800 date "2005-01-01" @default.
- W145425800 modified "2023-10-02" @default.
- W145425800 title "An Empirical Evaluation of Models of Text Document Similarity" @default.
- W145425800 cites W1833785989 @default.
- W145425800 cites W1983578042 @default.
- W145425800 cites W2001082470 @default.
- W145425800 cites W2001141328 @default.
- W145425800 cites W2027796863 @default.
- W145425800 cites W2032776989 @default.
- W145425800 cites W2056989578 @default.
- W145425800 cites W2059975159 @default.
- W145425800 cites W2078396547 @default.
- W145425800 cites W2108885144 @default.
- W145425800 cites W2109166628 @default.
- W145425800 cites W2115054880 @default.
- W145425800 cites W2133319859 @default.
- W145425800 cites W2147152072 @default.
- W145425800 hasPublicationYear "2005" @default.
- W145425800 type Work @default.
- W145425800 sameAs 145425800 @default.
- W145425800 citedByCount "86" @default.
- W145425800 countsByYear W1454258002012 @default.
- W145425800 countsByYear W1454258002013 @default.
- W145425800 countsByYear W1454258002014 @default.
- W145425800 countsByYear W1454258002015 @default.
- W145425800 countsByYear W1454258002016 @default.
- W145425800 countsByYear W1454258002017 @default.
- W145425800 countsByYear W1454258002018 @default.
- W145425800 countsByYear W1454258002019 @default.
- W145425800 countsByYear W1454258002020 @default.
- W145425800 countsByYear W1454258002021 @default.
- W145425800 crossrefType "journal-article" @default.
- W145425800 hasAuthorship W145425800A5059458561 @default.
- W145425800 hasAuthorship W145425800A5069667659 @default.
- W145425800 hasAuthorship W145425800A5081061405 @default.
- W145425800 hasConcept C103278499 @default.
- W145425800 hasConcept C105795698 @default.
- W145425800 hasConcept C115961682 @default.
- W145425800 hasConcept C120936955 @default.
- W145425800 hasConcept C138885662 @default.
- W145425800 hasConcept C154945302 @default.
- W145425800 hasConcept C15744967 @default.
- W145425800 hasConcept C170133592 @default.
- W145425800 hasConcept C204321447 @default.
- W145425800 hasConcept C23123220 @default.
- W145425800 hasConcept C2522767166 @default.
- W145425800 hasConcept C2780769345 @default.
- W145425800 hasConcept C33923547 @default.
- W145425800 hasConcept C41008148 @default.
- W145425800 hasConcept C41895202 @default.
- W145425800 hasConceptScore W145425800C103278499 @default.
- W145425800 hasConceptScore W145425800C105795698 @default.
- W145425800 hasConceptScore W145425800C115961682 @default.
- W145425800 hasConceptScore W145425800C120936955 @default.
- W145425800 hasConceptScore W145425800C138885662 @default.
- W145425800 hasConceptScore W145425800C154945302 @default.
- W145425800 hasConceptScore W145425800C15744967 @default.
- W145425800 hasConceptScore W145425800C170133592 @default.
- W145425800 hasConceptScore W145425800C204321447 @default.
- W145425800 hasConceptScore W145425800C23123220 @default.
- W145425800 hasConceptScore W145425800C2522767166 @default.
- W145425800 hasConceptScore W145425800C2780769345 @default.
- W145425800 hasConceptScore W145425800C33923547 @default.
- W145425800 hasConceptScore W145425800C41008148 @default.
- W145425800 hasConceptScore W145425800C41895202 @default.
- W145425800 hasIssue "27" @default.
- W145425800 hasLocation W1454258001 @default.
- W145425800 hasOpenAccess W145425800 @default.
- W145425800 hasPrimaryLocation W1454258001 @default.
- W145425800 hasRelatedWork W1566018662 @default.
- W145425800 hasRelatedWork W158057341 @default.
- W145425800 hasRelatedWork W1647729745 @default.
- W145425800 hasRelatedWork W1662133657 @default.
- W145425800 hasRelatedWork W1880262756 @default.
- W145425800 hasRelatedWork W1960027552 @default.
- W145425800 hasRelatedWork W1983578042 @default.
- W145425800 hasRelatedWork W1992914835 @default.
- W145425800 hasRelatedWork W2038721957 @default.
- W145425800 hasRelatedWork W2045929671 @default.
- W145425800 hasRelatedWork W2059975159 @default.
- W145425800 hasRelatedWork W2080100102 @default.
- W145425800 hasRelatedWork W2103318667 @default.
- W145425800 hasRelatedWork W2120779048 @default.
- W145425800 hasRelatedWork W2121184547 @default.
- W145425800 hasRelatedWork W2136930489 @default.
- W145425800 hasRelatedWork W2147152072 @default.
- W145425800 hasRelatedWork W2158997610 @default.
- W145425800 hasRelatedWork W2165612380 @default.
- W145425800 hasRelatedWork W2170682101 @default.
- W145425800 hasVolume "27" @default.
- W145425800 isParatext "false" @default.
- W145425800 isRetracted "false" @default.
- W145425800 magId "145425800" @default.
- W145425800 workType "article" @default.