Matches in SemOpenAlex for { <https://semopenalex.org/work/W104909239> ?p ?o ?g. }
- W104909239 abstract "Previous chapter Next chapter Full AccessProceedings Proceedings of the 2010 SIAM International Conference on Data Mining (SDM)A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data ExtractionNitin Jindal and Bing LiuNitin Jindal and Bing Liupp.930 - 941Chapter DOI:https://doi.org/10.1137/1.9781611972801.81PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract This paper studies structured data extraction from Web pages. One of the effective methods is tree matching, which can detect template patterns from web pages used for extraction. However, one major limitation of existing tree matching algorithms is their inability to deal with embedded lists with repeated patterns. In the Web context, lists are everywhere, e.g., lists of products, jobs and publications. Due to the fact that lists in trees may have different lengths, the match score of the trees can be very low although they follow exactly the same template pattern. To make the matter worse, a list can have nested lists in it at any level. To solve this problem, existing research uses various heuristics to detect candidate lists first and then applies tree matching to generate data extraction patterns. This paper proposes a generalized tree matching algorithm by extending an existing tree matching algorithm with the ability to handle nested lists through a novel grammar generation algorithm. To the best of our knowledge, this is the first tree matching algorithm that is able to consider lists. In addition, it is well-known that there are two problem formulations for Web data extraction: (1) pattern generation based on multiple pages following the same template, and (2) pattern generation based on a single page containing lists of data instances following the same templates (each list may use a different template). These two problems are currently solved using different algorithms. The proposed (single) algorithm is able to solve both problems effectively. Extensive experiments show that the new algorithm outperforms the state-of-the-art existing systems for both problems considerably. Previous chapter Next chapter RelatedDetails Published:2010ISBN:978-0-89871-703-7eISBN:978-1-61197-280-1 https://doi.org/10.1137/1.9781611972801Book Series Name:ProceedingsBook Code:PR136Book Pages:1-953Key words:We data extraction, Web mining" @default.
- W104909239 created "2016-06-24" @default.
- W104909239 creator A5009988923 @default.
- W104909239 creator A5070882730 @default.
- W104909239 date "2010-04-29" @default.
- W104909239 modified "2023-09-24" @default.
- W104909239 title "A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction" @default.
- W104909239 cites W1515087027 @default.
- W104909239 cites W1519606823 @default.
- W104909239 cites W1553019137 @default.
- W104909239 cites W1969005071 @default.
- W104909239 cites W1975009259 @default.
- W104909239 cites W2013725398 @default.
- W104909239 cites W2015551056 @default.
- W104909239 cites W2034797903 @default.
- W104909239 cites W2049461910 @default.
- W104909239 cites W2054658115 @default.
- W104909239 cites W2065568440 @default.
- W104909239 cites W2069388662 @default.
- W104909239 cites W2072489225 @default.
- W104909239 cites W2088600132 @default.
- W104909239 cites W2093559286 @default.
- W104909239 cites W2096496923 @default.
- W104909239 cites W2101909884 @default.
- W104909239 cites W2102694093 @default.
- W104909239 cites W2104086170 @default.
- W104909239 cites W2106235201 @default.
- W104909239 cites W2116493296 @default.
- W104909239 cites W2127855221 @default.
- W104909239 cites W2128341918 @default.
- W104909239 cites W2129595335 @default.
- W104909239 cites W2133669904 @default.
- W104909239 cites W2135479443 @default.
- W104909239 cites W2136753314 @default.
- W104909239 cites W2143309843 @default.
- W104909239 cites W2150721933 @default.
- W104909239 cites W2153072229 @default.
- W104909239 cites W2154444297 @default.
- W104909239 cites W2157160236 @default.
- W104909239 cites W2160196229 @default.
- W104909239 cites W2364964713 @default.
- W104909239 cites W1592635378 @default.
- W104909239 doi "https://doi.org/10.1137/1.9781611972801.81" @default.
- W104909239 hasPublicationYear "2010" @default.
- W104909239 type Work @default.
- W104909239 sameAs 104909239 @default.
- W104909239 citedByCount "24" @default.
- W104909239 countsByYear W1049092392012 @default.
- W104909239 countsByYear W1049092392013 @default.
- W104909239 countsByYear W1049092392014 @default.
- W104909239 countsByYear W1049092392015 @default.
- W104909239 countsByYear W1049092392017 @default.
- W104909239 countsByYear W1049092392019 @default.
- W104909239 countsByYear W1049092392020 @default.
- W104909239 crossrefType "proceedings-article" @default.
- W104909239 hasAuthorship W104909239A5009988923 @default.
- W104909239 hasAuthorship W104909239A5070882730 @default.
- W104909239 hasConcept C103000020 @default.
- W104909239 hasConcept C105795698 @default.
- W104909239 hasConcept C111919701 @default.
- W104909239 hasConcept C113174947 @default.
- W104909239 hasConcept C11413529 @default.
- W104909239 hasConcept C124101348 @default.
- W104909239 hasConcept C127705205 @default.
- W104909239 hasConcept C134306372 @default.
- W104909239 hasConcept C136764020 @default.
- W104909239 hasConcept C151730666 @default.
- W104909239 hasConcept C165064840 @default.
- W104909239 hasConcept C21959979 @default.
- W104909239 hasConcept C23123220 @default.
- W104909239 hasConcept C2779343474 @default.
- W104909239 hasConcept C33923547 @default.
- W104909239 hasConcept C41008148 @default.
- W104909239 hasConcept C5655090 @default.
- W104909239 hasConcept C80444323 @default.
- W104909239 hasConcept C86803240 @default.
- W104909239 hasConceptScore W104909239C103000020 @default.
- W104909239 hasConceptScore W104909239C105795698 @default.
- W104909239 hasConceptScore W104909239C111919701 @default.
- W104909239 hasConceptScore W104909239C113174947 @default.
- W104909239 hasConceptScore W104909239C11413529 @default.
- W104909239 hasConceptScore W104909239C124101348 @default.
- W104909239 hasConceptScore W104909239C127705205 @default.
- W104909239 hasConceptScore W104909239C134306372 @default.
- W104909239 hasConceptScore W104909239C136764020 @default.
- W104909239 hasConceptScore W104909239C151730666 @default.
- W104909239 hasConceptScore W104909239C165064840 @default.
- W104909239 hasConceptScore W104909239C21959979 @default.
- W104909239 hasConceptScore W104909239C23123220 @default.
- W104909239 hasConceptScore W104909239C2779343474 @default.
- W104909239 hasConceptScore W104909239C33923547 @default.
- W104909239 hasConceptScore W104909239C41008148 @default.
- W104909239 hasConceptScore W104909239C5655090 @default.
- W104909239 hasConceptScore W104909239C80444323 @default.
- W104909239 hasConceptScore W104909239C86803240 @default.
- W104909239 hasLocation W1049092391 @default.
- W104909239 hasOpenAccess W104909239 @default.
- W104909239 hasPrimaryLocation W1049092391 @default.
- W104909239 hasRelatedWork W1494002846 @default.
- W104909239 hasRelatedWork W1548492051 @default.