Matches in SemOpenAlex for { <https://semopenalex.org/work/W71104953> ?p ?o ?g. }
Showing items 1 to 75 of
75
with 100 items per page.
- W71104953 abstract "A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling. We show experimentally that trained on existing datasets for homepage identification underperform while classifying irrelevant pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for learning a conforming pair of classifiers using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the so that they make similar predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set." @default.
- W71104953 created "2016-06-24" @default.
- W71104953 creator A5001294898 @default.
- W71104953 creator A5009542542 @default.
- W71104953 creator A5076157833 @default.
- W71104953 creator A5089085275 @default.
- W71104953 date "2013-05-13" @default.
- W71104953 modified "2023-09-27" @default.
- W71104953 title "Researcher homepage classification using unlabeled data" @default.
- W71104953 cites W1501667924 @default.
- W71104953 cites W1983531058 @default.
- W71104953 cites W2003471189 @default.
- W71104953 cites W2022322548 @default.
- W71104953 cites W2037603696 @default.
- W71104953 cites W2048679005 @default.
- W71104953 cites W2059586463 @default.
- W71104953 cites W2081580037 @default.
- W71104953 cites W2097083044 @default.
- W71104953 cites W2097089247 @default.
- W71104953 cites W2101210369 @default.
- W71104953 cites W2104660959 @default.
- W71104953 cites W2111700528 @default.
- W71104953 cites W2121702856 @default.
- W71104953 cites W2125327503 @default.
- W71104953 cites W2133990480 @default.
- W71104953 cites W2134491992 @default.
- W71104953 cites W2141416357 @default.
- W71104953 cites W2153635508 @default.
- W71104953 cites W2156772624 @default.
- W71104953 cites W2161920802 @default.
- W71104953 cites W2169899598 @default.
- W71104953 cites W2171629518 @default.
- W71104953 cites W4235505822 @default.
- W71104953 cites W87822204 @default.
- W71104953 doi "https://doi.org/10.1145/2488388.2488430" @default.
- W71104953 hasPublicationYear "2013" @default.
- W71104953 type Work @default.
- W71104953 sameAs 71104953 @default.
- W71104953 citedByCount "14" @default.
- W71104953 countsByYear W711049532014 @default.
- W71104953 countsByYear W711049532015 @default.
- W71104953 countsByYear W711049532016 @default.
- W71104953 countsByYear W711049532017 @default.
- W71104953 countsByYear W711049532019 @default.
- W71104953 countsByYear W711049532021 @default.
- W71104953 crossrefType "proceedings-article" @default.
- W71104953 hasAuthorship W71104953A5001294898 @default.
- W71104953 hasAuthorship W71104953A5009542542 @default.
- W71104953 hasAuthorship W71104953A5076157833 @default.
- W71104953 hasAuthorship W71104953A5089085275 @default.
- W71104953 hasConcept C154945302 @default.
- W71104953 hasConcept C204321447 @default.
- W71104953 hasConcept C23123220 @default.
- W71104953 hasConcept C41008148 @default.
- W71104953 hasConceptScore W71104953C154945302 @default.
- W71104953 hasConceptScore W71104953C204321447 @default.
- W71104953 hasConceptScore W71104953C23123220 @default.
- W71104953 hasConceptScore W71104953C41008148 @default.
- W71104953 hasLocation W711049531 @default.
- W71104953 hasOpenAccess W71104953 @default.
- W71104953 hasPrimaryLocation W711049531 @default.
- W71104953 hasRelatedWork W1552159754 @default.
- W71104953 hasRelatedWork W2115485936 @default.
- W71104953 hasRelatedWork W2144190808 @default.
- W71104953 hasRelatedWork W2357241418 @default.
- W71104953 hasRelatedWork W2366644548 @default.
- W71104953 hasRelatedWork W2368651715 @default.
- W71104953 hasRelatedWork W2376314740 @default.
- W71104953 hasRelatedWork W2384888906 @default.
- W71104953 hasRelatedWork W2611614995 @default.
- W71104953 hasRelatedWork W2789919619 @default.
- W71104953 isParatext "false" @default.
- W71104953 isRetracted "false" @default.
- W71104953 magId "71104953" @default.
- W71104953 workType "article" @default.