Matches in SemOpenAlex for { <https://semopenalex.org/work/W3044367042> ?p ?o ?g. }
- W3044367042 endingPage "e17853" @default.
- W3044367042 startingPage "e17853" @default.
- W3044367042 abstract "Background The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines." @default.
- W3044367042 created "2020-07-29" @default.
- W3044367042 creator A5051306698 @default.
- W3044367042 creator A5052606056 @default.
- W3044367042 creator A5067786272 @default.
- W3044367042 date "2020-07-24" @default.
- W3044367042 modified "2023-10-16" @default.
- W3044367042 title "Crawling the German Health Web: Exploratory Study and Graph Analysis" @default.
- W3044367042 cites W1475732121 @default.
- W3044367042 cites W1489992655 @default.
- W3044367042 cites W156343458 @default.
- W3044367042 cites W1646491708 @default.
- W3044367042 cites W1832137115 @default.
- W3044367042 cites W1838907055 @default.
- W3044367042 cites W1964421503 @default.
- W3044367042 cites W1975879668 @default.
- W3044367042 cites W1976334448 @default.
- W3044367042 cites W1978394996 @default.
- W3044367042 cites W1980580891 @default.
- W3044367042 cites W1989338554 @default.
- W3044367042 cites W2013469079 @default.
- W3044367042 cites W2014134732 @default.
- W3044367042 cites W2017224880 @default.
- W3044367042 cites W2017726337 @default.
- W3044367042 cites W2029341294 @default.
- W3044367042 cites W2034835104 @default.
- W3044367042 cites W2036120890 @default.
- W3044367042 cites W2037284289 @default.
- W3044367042 cites W2039161780 @default.
- W3044367042 cites W2040869915 @default.
- W3044367042 cites W2045257928 @default.
- W3044367042 cites W2045833544 @default.
- W3044367042 cites W2045998703 @default.
- W3044367042 cites W2047607378 @default.
- W3044367042 cites W2079175517 @default.
- W3044367042 cites W2083089853 @default.
- W3044367042 cites W2101612333 @default.
- W3044367042 cites W2116424877 @default.
- W3044367042 cites W2120101509 @default.
- W3044367042 cites W2121354841 @default.
- W3044367042 cites W2124496187 @default.
- W3044367042 cites W2124637492 @default.
- W3044367042 cites W2131681506 @default.
- W3044367042 cites W2133433261 @default.
- W3044367042 cites W2149684865 @default.
- W3044367042 cites W2158420658 @default.
- W3044367042 cites W2160375265 @default.
- W3044367042 cites W2164542999 @default.
- W3044367042 cites W2164777277 @default.
- W3044367042 cites W2168571645 @default.
- W3044367042 cites W2175110005 @default.
- W3044367042 cites W2261117743 @default.
- W3044367042 cites W2270949707 @default.
- W3044367042 cites W2274126126 @default.
- W3044367042 cites W2397847938 @default.
- W3044367042 cites W2425612555 @default.
- W3044367042 cites W2544700275 @default.
- W3044367042 cites W2595038737 @default.
- W3044367042 cites W2613488435 @default.
- W3044367042 cites W2620616291 @default.
- W3044367042 cites W2743928556 @default.
- W3044367042 cites W2912429677 @default.
- W3044367042 cites W2941486823 @default.
- W3044367042 cites W2963932066 @default.
- W3044367042 cites W3099539743 @default.
- W3044367042 cites W4231080135 @default.
- W3044367042 cites W4247642676 @default.
- W3044367042 cites W4292361644 @default.
- W3044367042 doi "https://doi.org/10.2196/17853" @default.
- W3044367042 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/7414401" @default.
- W3044367042 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/32706701" @default.
- W3044367042 hasPublicationYear "2020" @default.
- W3044367042 type Work @default.
- W3044367042 sameAs 3044367042 @default.
- W3044367042 citedByCount "7" @default.
- W3044367042 countsByYear W30443670422021 @default.
- W3044367042 countsByYear W30443670422022 @default.
- W3044367042 countsByYear W30443670422023 @default.
- W3044367042 crossrefType "journal-article" @default.
- W3044367042 hasAuthorship W3044367042A5051306698 @default.
- W3044367042 hasAuthorship W3044367042A5052606056 @default.
- W3044367042 hasAuthorship W3044367042A5067786272 @default.
- W3044367042 hasBestOaLocation W30443670421 @default.
- W3044367042 hasConcept C100368936 @default.
- W3044367042 hasConcept C105702510 @default.
- W3044367042 hasConcept C106476913 @default.
- W3044367042 hasConcept C110875604 @default.
- W3044367042 hasConcept C136764020 @default.
- W3044367042 hasConcept C13743948 @default.
- W3044367042 hasConcept C138816342 @default.
- W3044367042 hasConcept C147268084 @default.
- W3044367042 hasConcept C154945302 @default.
- W3044367042 hasConcept C159110408 @default.
- W3044367042 hasConcept C185618831 @default.
- W3044367042 hasConcept C23123220 @default.
- W3044367042 hasConcept C2522767166 @default.
- W3044367042 hasConcept C41008148 @default.
- W3044367042 hasConcept C71924100 @default.