Matches in SemOpenAlex for { <https://semopenalex.org/work/W1592247788> ?p ?o ?g. }
- W1592247788 endingPage "226" @default.
- W1592247788 startingPage "216" @default.
- W1592247788 abstract "The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible." @default.
- W1592247788 created "2016-06-24" @default.
- W1592247788 creator A5010759879 @default.
- W1592247788 creator A5017132748 @default.
- W1592247788 creator A5045420626 @default.
- W1592247788 creator A5054450193 @default.
- W1592247788 date "2015-05-18" @default.
- W1592247788 modified "2023-09-23" @default.
- W1592247788 title "Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications" @default.
- W1592247788 cites W1532325895 @default.
- W1592247788 cites W153411086 @default.
- W1592247788 cites W1554394657 @default.
- W1592247788 cites W1634005169 @default.
- W1592247788 cites W1685426458 @default.
- W1592247788 cites W1870428314 @default.
- W1592247788 cites W188912188 @default.
- W1592247788 cites W1966979133 @default.
- W1592247788 cites W2012833704 @default.
- W1592247788 cites W2032475142 @default.
- W1592247788 cites W2036295879 @default.
- W1592247788 cites W2036477303 @default.
- W1592247788 cites W2041565863 @default.
- W1592247788 cites W2062723323 @default.
- W1592247788 cites W2065347041 @default.
- W1592247788 cites W2073459066 @default.
- W1592247788 cites W2076784111 @default.
- W1592247788 cites W2089497633 @default.
- W1592247788 cites W2095897464 @default.
- W1592247788 cites W2105406322 @default.
- W1592247788 cites W2108399535 @default.
- W1592247788 cites W2120480077 @default.
- W1592247788 cites W2150102617 @default.
- W1592247788 cites W2165932491 @default.
- W1592247788 cites W2167853719 @default.
- W1592247788 cites W2184557417 @default.
- W1592247788 cites W2190924945 @default.
- W1592247788 cites W2294145134 @default.
- W1592247788 cites W2750287184 @default.
- W1592247788 cites W2882319491 @default.
- W1592247788 cites W2949863272 @default.
- W1592247788 cites W2950789693 @default.
- W1592247788 cites W2962734018 @default.
- W1592247788 cites W2979473749 @default.
- W1592247788 cites W299870200 @default.
- W1592247788 cites W3013019162 @default.
- W1592247788 hasPublicationYear "2015" @default.
- W1592247788 type Work @default.
- W1592247788 sameAs 1592247788 @default.
- W1592247788 citedByCount "2" @default.
- W1592247788 countsByYear W15922477882018 @default.
- W1592247788 countsByYear W15922477882019 @default.
- W1592247788 crossrefType "proceedings-article" @default.
- W1592247788 hasAuthorship W1592247788A5010759879 @default.
- W1592247788 hasAuthorship W1592247788A5017132748 @default.
- W1592247788 hasAuthorship W1592247788A5045420626 @default.
- W1592247788 hasAuthorship W1592247788A5054450193 @default.
- W1592247788 hasBestOaLocation W15922477881 @default.
- W1592247788 hasConcept C111919701 @default.
- W1592247788 hasConcept C113174947 @default.
- W1592247788 hasConcept C11413529 @default.
- W1592247788 hasConcept C124101348 @default.
- W1592247788 hasConcept C134306372 @default.
- W1592247788 hasConcept C136764020 @default.
- W1592247788 hasConcept C154945302 @default.
- W1592247788 hasConcept C177937566 @default.
- W1592247788 hasConcept C199683683 @default.
- W1592247788 hasConcept C21959979 @default.
- W1592247788 hasConcept C23123220 @default.
- W1592247788 hasConcept C33704608 @default.
- W1592247788 hasConcept C33923547 @default.
- W1592247788 hasConcept C41008148 @default.
- W1592247788 hasConcept C48044578 @default.
- W1592247788 hasConcept C73555534 @default.
- W1592247788 hasConcept C77088390 @default.
- W1592247788 hasConcept C94641424 @default.
- W1592247788 hasConceptScore W1592247788C111919701 @default.
- W1592247788 hasConceptScore W1592247788C113174947 @default.
- W1592247788 hasConceptScore W1592247788C11413529 @default.
- W1592247788 hasConceptScore W1592247788C124101348 @default.
- W1592247788 hasConceptScore W1592247788C134306372 @default.
- W1592247788 hasConceptScore W1592247788C136764020 @default.
- W1592247788 hasConceptScore W1592247788C154945302 @default.
- W1592247788 hasConceptScore W1592247788C177937566 @default.
- W1592247788 hasConceptScore W1592247788C199683683 @default.
- W1592247788 hasConceptScore W1592247788C21959979 @default.
- W1592247788 hasConceptScore W1592247788C23123220 @default.
- W1592247788 hasConceptScore W1592247788C33704608 @default.
- W1592247788 hasConceptScore W1592247788C33923547 @default.
- W1592247788 hasConceptScore W1592247788C41008148 @default.
- W1592247788 hasConceptScore W1592247788C48044578 @default.
- W1592247788 hasConceptScore W1592247788C73555534 @default.
- W1592247788 hasConceptScore W1592247788C77088390 @default.
- W1592247788 hasConceptScore W1592247788C94641424 @default.
- W1592247788 hasLocation W15922477881 @default.
- W1592247788 hasLocation W15922477882 @default.
- W1592247788 hasOpenAccess W1592247788 @default.
- W1592247788 hasPrimaryLocation W15922477881 @default.
- W1592247788 hasRelatedWork W1976273795 @default.