Matches in SemOpenAlex for { <https://semopenalex.org/work/W1601676518> ?p ?o ?g. }
Showing items 1 to 96 of
96
with 100 items per page.
- W1601676518 endingPage "179" @default.
- W1601676518 startingPage "173" @default.
- W1601676518 abstract "Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clustering as an exploration tool. We have also done some work on synonyms and evaluation of clustering results. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Swedish has more morphological variation than for instance English. We show that it is beneficial to use the lemma form of words rather than the word forms. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages. Our experiments show that it is beneficial to split solid compounds into their parts when building the representation. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. We have also tried to differentiate between homographs, words that look alike but mean different things, by augmenting all words with a tag indicating their part of speech. None of our experiments using phrases or part of speech information have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is – text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe how evaluation can be improved for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. In some related work we have built a dictionary of synonyms. We use it to compare two different principles for automatic word relation extraction through clustering of words. Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that “farmers smoke less than the average”, which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values." @default.
- W1601676518 created "2016-06-24" @default.
- W1601676518 creator A5070579609 @default.
- W1601676518 creator A5087365007 @default.
- W1601676518 date "2005-01-01" @default.
- W1601676518 modified "2023-10-01" @default.
- W1601676518 title "The impact of phrases in document clustering for Swedish" @default.
- W1601676518 cites W1522930108 @default.
- W1601676518 cites W1569899956 @default.
- W1601676518 cites W1651093245 @default.
- W1601676518 cites W173956503 @default.
- W1601676518 cites W2100958137 @default.
- W1601676518 cites W2126802698 @default.
- W1601676518 cites W2137763598 @default.
- W1601676518 hasPublicationYear "2005" @default.
- W1601676518 type Work @default.
- W1601676518 sameAs 1601676518 @default.
- W1601676518 citedByCount "8" @default.
- W1601676518 countsByYear W16016765182013 @default.
- W1601676518 countsByYear W16016765182014 @default.
- W1601676518 crossrefType "journal-article" @default.
- W1601676518 hasAuthorship W1601676518A5070579609 @default.
- W1601676518 hasAuthorship W1601676518A5087365007 @default.
- W1601676518 hasConcept C111919701 @default.
- W1601676518 hasConcept C154945302 @default.
- W1601676518 hasConcept C177264268 @default.
- W1601676518 hasConcept C17744445 @default.
- W1601676518 hasConcept C177937566 @default.
- W1601676518 hasConcept C18903297 @default.
- W1601676518 hasConcept C199360897 @default.
- W1601676518 hasConcept C199539241 @default.
- W1601676518 hasConcept C204321447 @default.
- W1601676518 hasConcept C23123220 @default.
- W1601676518 hasConcept C2524010 @default.
- W1601676518 hasConcept C2776359362 @default.
- W1601676518 hasConcept C2777759810 @default.
- W1601676518 hasConcept C2778572836 @default.
- W1601676518 hasConcept C33923547 @default.
- W1601676518 hasConcept C41008148 @default.
- W1601676518 hasConcept C46757340 @default.
- W1601676518 hasConcept C73555534 @default.
- W1601676518 hasConcept C86803240 @default.
- W1601676518 hasConcept C89686163 @default.
- W1601676518 hasConcept C90805587 @default.
- W1601676518 hasConcept C94625758 @default.
- W1601676518 hasConceptScore W1601676518C111919701 @default.
- W1601676518 hasConceptScore W1601676518C154945302 @default.
- W1601676518 hasConceptScore W1601676518C177264268 @default.
- W1601676518 hasConceptScore W1601676518C17744445 @default.
- W1601676518 hasConceptScore W1601676518C177937566 @default.
- W1601676518 hasConceptScore W1601676518C18903297 @default.
- W1601676518 hasConceptScore W1601676518C199360897 @default.
- W1601676518 hasConceptScore W1601676518C199539241 @default.
- W1601676518 hasConceptScore W1601676518C204321447 @default.
- W1601676518 hasConceptScore W1601676518C23123220 @default.
- W1601676518 hasConceptScore W1601676518C2524010 @default.
- W1601676518 hasConceptScore W1601676518C2776359362 @default.
- W1601676518 hasConceptScore W1601676518C2777759810 @default.
- W1601676518 hasConceptScore W1601676518C2778572836 @default.
- W1601676518 hasConceptScore W1601676518C33923547 @default.
- W1601676518 hasConceptScore W1601676518C41008148 @default.
- W1601676518 hasConceptScore W1601676518C46757340 @default.
- W1601676518 hasConceptScore W1601676518C73555534 @default.
- W1601676518 hasConceptScore W1601676518C86803240 @default.
- W1601676518 hasConceptScore W1601676518C89686163 @default.
- W1601676518 hasConceptScore W1601676518C90805587 @default.
- W1601676518 hasConceptScore W1601676518C94625758 @default.
- W1601676518 hasLocation W16016765181 @default.
- W1601676518 hasOpenAccess W1601676518 @default.
- W1601676518 hasPrimaryLocation W16016765181 @default.
- W1601676518 hasRelatedWork W1489033817 @default.
- W1601676518 hasRelatedWork W1496173351 @default.
- W1601676518 hasRelatedWork W1569899956 @default.
- W1601676518 hasRelatedWork W186475296 @default.
- W1601676518 hasRelatedWork W1993539890 @default.
- W1601676518 hasRelatedWork W2036614928 @default.
- W1601676518 hasRelatedWork W2130508301 @default.
- W1601676518 hasRelatedWork W2156635423 @default.
- W1601676518 hasRelatedWork W2164107060 @default.
- W1601676518 hasRelatedWork W2303942292 @default.
- W1601676518 hasRelatedWork W2468893743 @default.
- W1601676518 hasRelatedWork W260234060 @default.
- W1601676518 hasRelatedWork W2621016166 @default.
- W1601676518 hasRelatedWork W2785674851 @default.
- W1601676518 hasRelatedWork W2809008043 @default.
- W1601676518 hasRelatedWork W2982591137 @default.
- W1601676518 hasRelatedWork W2998766172 @default.
- W1601676518 hasRelatedWork W3132159544 @default.
- W1601676518 hasRelatedWork W2187453790 @default.
- W1601676518 hasRelatedWork W2856918126 @default.
- W1601676518 isParatext "false" @default.
- W1601676518 isRetracted "false" @default.
- W1601676518 magId "1601676518" @default.
- W1601676518 workType "article" @default.