Matches in SemOpenAlex for { <https://semopenalex.org/work/W607530189> ?p ?o ?g. }
- W607530189 abstract "Text categorization is the task of discovering the category or class text documents belongs to, or in other words spotting the correct topic for text documents. While there today exists many machine learning schemes for building automatic classifiers, these are typically resource demanding and do not always achieve the best results when given the whole contents of the documents. A popular solution to these problems is called feature selection. The features (e.g. terms) in a document collection are given weights based on a simple scheme, and then ranked by these weights. Next, each document is represented using only the top ranked features, typically only a few percent of the features. The classifier is then built in considerably less time, and might even improve accuracy. In situations where the documents can belong to one of a series of categories, one can either build a multi-class classifier and use one feature set for all categories, or one can split the problem into a series of binary categorization tasks (deciding if documents belong to a category or not) and create one ranked feature subset for each category/classifier. Many feature selection metrics have been suggested over the last decades, including supervised methods that make use of a manually pre-categorized set of training documents, and unsupervised methods that need only training documents of the same type or collection that is to be categorized. While many of these look promising, there has been a lack of large-scale comparison experiments. Also, several methods have been proposed the last two years. Moreover, most evaluations are conducted on a set of binary tasks instead of a multi-class task as this often gives better results, although multi-class categorization with a joint feature set often is used in operational environments. In this report, we present results from the comparison of 16 feature selection methods (in addition to random selection) using various feature set sizes. Of these, 5 were unsupervised , and 11 were supervised. All methods are tested on both a Naive Bayes (NB) classifier and a Support Vector Machine (SVM) classifier. We conducted multi-class experiments using a collection with 20 non-overlapping categories, and each feature selection method produced feature sets common for all the categories. We also combined feature selection methods and evaluated their joint efforts. We found that the classical supervised methods had the best performance, including Chi Square, Information Gain and Mutual Information. The Chi Square variant GSS coefficient was also among the top performers. Odds Ratio showed excellent performance for NB, but not for SVM. The three unsupervised methods Collection Frequency, Collection Frequency Inverse Document Frequency and Term Frequency Document Frequency all showed performances close to the best group. The Bi-Normal Separation metric produced excellent results for the smallest feature subsets. The weirdness factor performed several times better than random selection, but was not among the top performing group. Some combination experiments achieved better results than each method alone, but the majority did not. The top performers Chi square and GSS coefficient classified more documents when used together than alone.Four of the five combinations that showed increase in performance included the BNS metric." @default.
- W607530189 created "2016-06-24" @default.
- W607530189 creator A5067138475 @default.
- W607530189 date "2009-01-01" @default.
- W607530189 modified "2023-09-24" @default.
- W607530189 title "Feature Selection for Text Categorisation" @default.
- W607530189 cites W140777655 @default.
- W607530189 cites W1492700449 @default.
- W607530189 cites W1509392121 @default.
- W607530189 cites W1520963883 @default.
- W607530189 cites W1523389133 @default.
- W607530189 cites W1532325895 @default.
- W607530189 cites W1576676390 @default.
- W607530189 cites W1592212241 @default.
- W607530189 cites W1595744771 @default.
- W607530189 cites W1620204465 @default.
- W607530189 cites W1660390307 @default.
- W607530189 cites W1853239596 @default.
- W607530189 cites W1972640883 @default.
- W607530189 cites W1982291220 @default.
- W607530189 cites W1983078185 @default.
- W607530189 cites W2008941900 @default.
- W607530189 cites W2024228866 @default.
- W607530189 cites W2071664212 @default.
- W607530189 cites W2096152098 @default.
- W607530189 cites W2103333826 @default.
- W607530189 cites W2149684865 @default.
- W607530189 cites W2149706766 @default.
- W607530189 cites W2150102617 @default.
- W607530189 cites W2165849038 @default.
- W607530189 cites W2166183437 @default.
- W607530189 cites W2185854690 @default.
- W607530189 cites W2391197895 @default.
- W607530189 cites W2435251607 @default.
- W607530189 cites W2478273975 @default.
- W607530189 cites W2504757255 @default.
- W607530189 hasPublicationYear "2009" @default.
- W607530189 type Work @default.
- W607530189 sameAs 607530189 @default.
- W607530189 citedByCount "4" @default.
- W607530189 countsByYear W6075301892014 @default.
- W607530189 countsByYear W6075301892015 @default.
- W607530189 countsByYear W6075301892017 @default.
- W607530189 crossrefType "dissertation" @default.
- W607530189 hasAuthorship W607530189A5067138475 @default.
- W607530189 hasConcept C119857082 @default.
- W607530189 hasConcept C12267149 @default.
- W607530189 hasConcept C148483581 @default.
- W607530189 hasConcept C154945302 @default.
- W607530189 hasConcept C177264268 @default.
- W607530189 hasConcept C199360897 @default.
- W607530189 hasConcept C204321447 @default.
- W607530189 hasConcept C23123220 @default.
- W607530189 hasConcept C2777212361 @default.
- W607530189 hasConcept C2780479914 @default.
- W607530189 hasConcept C2986744138 @default.
- W607530189 hasConcept C41008148 @default.
- W607530189 hasConcept C66905080 @default.
- W607530189 hasConcept C94124525 @default.
- W607530189 hasConcept C95623464 @default.
- W607530189 hasConceptScore W607530189C119857082 @default.
- W607530189 hasConceptScore W607530189C12267149 @default.
- W607530189 hasConceptScore W607530189C148483581 @default.
- W607530189 hasConceptScore W607530189C154945302 @default.
- W607530189 hasConceptScore W607530189C177264268 @default.
- W607530189 hasConceptScore W607530189C199360897 @default.
- W607530189 hasConceptScore W607530189C204321447 @default.
- W607530189 hasConceptScore W607530189C23123220 @default.
- W607530189 hasConceptScore W607530189C2777212361 @default.
- W607530189 hasConceptScore W607530189C2780479914 @default.
- W607530189 hasConceptScore W607530189C2986744138 @default.
- W607530189 hasConceptScore W607530189C41008148 @default.
- W607530189 hasConceptScore W607530189C66905080 @default.
- W607530189 hasConceptScore W607530189C94124525 @default.
- W607530189 hasConceptScore W607530189C95623464 @default.
- W607530189 hasLocation W6075301891 @default.
- W607530189 hasOpenAccess W607530189 @default.
- W607530189 hasPrimaryLocation W6075301891 @default.
- W607530189 hasRelatedWork W1514800236 @default.
- W607530189 hasRelatedWork W151898501 @default.
- W607530189 hasRelatedWork W1524564918 @default.
- W607530189 hasRelatedWork W1902658720 @default.
- W607530189 hasRelatedWork W2022689990 @default.
- W607530189 hasRelatedWork W2023384245 @default.
- W607530189 hasRelatedWork W2045424838 @default.
- W607530189 hasRelatedWork W2070382004 @default.
- W607530189 hasRelatedWork W2071888997 @default.
- W607530189 hasRelatedWork W2090702536 @default.
- W607530189 hasRelatedWork W2153272216 @default.
- W607530189 hasRelatedWork W2279354358 @default.
- W607530189 hasRelatedWork W2366199453 @default.
- W607530189 hasRelatedWork W2518330421 @default.
- W607530189 hasRelatedWork W2772554035 @default.
- W607530189 hasRelatedWork W2901030322 @default.
- W607530189 hasRelatedWork W2981340714 @default.
- W607530189 hasRelatedWork W3047189009 @default.
- W607530189 hasRelatedWork W3138444395 @default.
- W607530189 hasRelatedWork W1764667536 @default.
- W607530189 isParatext "false" @default.
- W607530189 isRetracted "false" @default.