Matches in SemOpenAlex for { <https://semopenalex.org/work/W2276279539> ?p ?o ?g. }
- W2276279539 abstract "The world wide web contains vast amounts of data, but only a small portion of it is accessible in an operational form by machines. The rest of this vast collection is behind a presentation layer that renders web pages in a human-friendly form but also hampers machine-processing of data. The task of converting web data into operational form is the task of data extraction. Current approaches to data extraction from the web either require human-effort to guide supervised learning algorithms or are customized to extract a narrow range of data types in specific domains. We focus on the broader problem of discovering the underlying structure of any database-generated web site. Our approach automatically discovers relational data that is hidden behind these web sites by combining experts that identify the relationship between surface structure and the underlying structure. Our approach is to have a set of software experts that analyze a web site's pages. Each of these experts is specialized to recognize a particular type of structure. These experts discover similarities between data items within the context of the particular types of structure they analyze and output their discoveries as hypotheses in a common hypothesis language. We find the most likely clustering of data using a probabilistic framework in which the hypotheses provide the evidence. From the clusters, the relational form of the data is derived. We develop two frameworks following the principles of our approach. The first framework introduces a common hypothesis language in which heterogeneous experts express their discoveries. The second framework extends the common language to allow experts to assign confidence scores to their hypotheses. We experiment in the web domain by comparing the output of our approach to the data extracted by a supervised wrapper-induction system and validated manually. Our results show that our approach performs well in the data extraction task on a variety of web sites. Our approach is applicable to other structure discovery problems as well. We demonstrate this by successfully applying our approach in the record deduplication domain." @default.
- W2276279539 created "2016-06-24" @default.
- W2276279539 creator A5019089688 @default.
- W2276279539 creator A5062362044 @default.
- W2276279539 date "2008-01-01" @default.
- W2276279539 modified "2023-09-27" @default.
- W2276279539 title "Discovering web structure with multiple experts in a clustering framework" @default.
- W2276279539 cites W106033028 @default.
- W2276279539 cites W136646293 @default.
- W2276279539 cites W1485997076 @default.
- W2276279539 cites W1487588218 @default.
- W2276279539 cites W1564249232 @default.
- W2276279539 cites W1569752386 @default.
- W2276279539 cites W1577626262 @default.
- W2276279539 cites W1596382552 @default.
- W2276279539 cites W1612003148 @default.
- W2276279539 cites W1646278814 @default.
- W2276279539 cites W1761401273 @default.
- W2276279539 cites W1800493452 @default.
- W2276279539 cites W1829475407 @default.
- W2276279539 cites W1880262756 @default.
- W2276279539 cites W1956559956 @default.
- W2276279539 cites W1979629649 @default.
- W2276279539 cites W1984942099 @default.
- W2276279539 cites W1992419399 @default.
- W2276279539 cites W1994584977 @default.
- W2276279539 cites W2007172042 @default.
- W2276279539 cites W2013725398 @default.
- W2276279539 cites W2013970953 @default.
- W2276279539 cites W2019838344 @default.
- W2276279539 cites W203603362 @default.
- W2276279539 cites W2036987424 @default.
- W2276279539 cites W2042448356 @default.
- W2276279539 cites W2046934276 @default.
- W2276279539 cites W204750116 @default.
- W2276279539 cites W2048314862 @default.
- W2276279539 cites W2051768896 @default.
- W2276279539 cites W2071744657 @default.
- W2276279539 cites W2073471108 @default.
- W2276279539 cites W2077990749 @default.
- W2276279539 cites W2078206655 @default.
- W2276279539 cites W2078762661 @default.
- W2276279539 cites W2080676333 @default.
- W2276279539 cites W2092772700 @default.
- W2276279539 cites W2093559286 @default.
- W2276279539 cites W2096496923 @default.
- W2276279539 cites W2097730395 @default.
- W2276279539 cites W2099283723 @default.
- W2276279539 cites W2102189859 @default.
- W2276279539 cites W2104086170 @default.
- W2276279539 cites W2115304356 @default.
- W2276279539 cites W2115657355 @default.
- W2276279539 cites W2115770258 @default.
- W2276279539 cites W2125570474 @default.
- W2276279539 cites W2129113961 @default.
- W2276279539 cites W2134089414 @default.
- W2276279539 cites W2134150392 @default.
- W2276279539 cites W2135479443 @default.
- W2276279539 cites W2139578439 @default.
- W2276279539 cites W2139956879 @default.
- W2276279539 cites W2141797076 @default.
- W2276279539 cites W2142399819 @default.
- W2276279539 cites W2143349571 @default.
- W2276279539 cites W2150721933 @default.
- W2276279539 cites W2151910393 @default.
- W2276279539 cites W2154785834 @default.
- W2276279539 cites W2157369095 @default.
- W2276279539 cites W2159080219 @default.
- W2276279539 cites W2160619799 @default.
- W2276279539 cites W2164456230 @default.
- W2276279539 cites W2166686713 @default.
- W2276279539 cites W2166817967 @default.
- W2276279539 cites W2177750903 @default.
- W2276279539 cites W2271347174 @default.
- W2276279539 cites W2402933174 @default.
- W2276279539 cites W2434205482 @default.
- W2276279539 cites W2440833291 @default.
- W2276279539 cites W2610496362 @default.
- W2276279539 cites W2799061466 @default.
- W2276279539 cites W2988119170 @default.
- W2276279539 cites W3036215894 @default.
- W2276279539 cites W46452414 @default.
- W2276279539 cites W66767626 @default.
- W2276279539 cites W82050445 @default.
- W2276279539 hasPublicationYear "2008" @default.
- W2276279539 type Work @default.
- W2276279539 sameAs 2276279539 @default.
- W2276279539 citedByCount "0" @default.
- W2276279539 crossrefType "journal-article" @default.
- W2276279539 hasAuthorship W2276279539A5019089688 @default.
- W2276279539 hasAuthorship W2276279539A5062362044 @default.
- W2276279539 hasConcept C130436687 @default.
- W2276279539 hasConcept C136764020 @default.
- W2276279539 hasConcept C151730666 @default.
- W2276279539 hasConcept C154945302 @default.
- W2276279539 hasConcept C162005631 @default.
- W2276279539 hasConcept C21959979 @default.
- W2276279539 hasConcept C23123220 @default.
- W2276279539 hasConcept C2522767166 @default.
- W2276279539 hasConcept C2779343474 @default.