Matches in SemOpenAlex for { <https://semopenalex.org/work/W2963535486> ?p ?o ?g. }
- W2963535486 abstract "Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from 0,1d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence." @default.
- W2963535486 created "2019-07-30" @default.
- W2963535486 creator A5006653893 @default.
- W2963535486 creator A5014293815 @default.
- W2963535486 creator A5027425067 @default.
- W2963535486 date "2018-05-27" @default.
- W2963535486 modified "2023-10-16" @default.
- W2963535486 title "Set Similarity Search for Skewed Data" @default.
- W2963535486 cites W1430582609 @default.
- W2963535486 cites W1455310343 @default.
- W2963535486 cites W1647207848 @default.
- W2963535486 cites W1973001156 @default.
- W2963535486 cites W1977046819 @default.
- W2963535486 cites W1989664320 @default.
- W2963535486 cites W1995725694 @default.
- W2963535486 cites W2010416066 @default.
- W2963535486 cites W2012833704 @default.
- W2963535486 cites W2017851434 @default.
- W2963535486 cites W2080844740 @default.
- W2963535486 cites W2096598900 @default.
- W2963535486 cites W2097184821 @default.
- W2963535486 cites W2097776316 @default.
- W2963535486 cites W2098900423 @default.
- W2963535486 cites W2115854352 @default.
- W2963535486 cites W2121269638 @default.
- W2963535486 cites W2127859122 @default.
- W2963535486 cites W2152565070 @default.
- W2963535486 cites W2263882035 @default.
- W2963535486 cites W2268968630 @default.
- W2963535486 cites W2295744963 @default.
- W2963535486 cites W2396588571 @default.
- W2963535486 cites W2568140450 @default.
- W2963535486 cites W2574633002 @default.
- W2963535486 cites W2585199909 @default.
- W2963535486 cites W2604829000 @default.
- W2963535486 cites W2612210001 @default.
- W2963535486 cites W2735058673 @default.
- W2963535486 cites W2963046172 @default.
- W2963535486 cites W2963886823 @default.
- W2963535486 cites W2964013013 @default.
- W2963535486 cites W2964142086 @default.
- W2963535486 cites W3098556943 @default.
- W2963535486 cites W3105727767 @default.
- W2963535486 cites W566315627 @default.
- W2963535486 doi "https://doi.org/10.1145/3196959.3196985" @default.
- W2963535486 hasPublicationYear "2018" @default.
- W2963535486 type Work @default.
- W2963535486 sameAs 2963535486 @default.
- W2963535486 citedByCount "7" @default.
- W2963535486 countsByYear W29635354862019 @default.
- W2963535486 countsByYear W29635354862020 @default.
- W2963535486 countsByYear W29635354862023 @default.
- W2963535486 crossrefType "proceedings-article" @default.
- W2963535486 hasAuthorship W2963535486A5006653893 @default.
- W2963535486 hasAuthorship W2963535486A5014293815 @default.
- W2963535486 hasAuthorship W2963535486A5027425067 @default.
- W2963535486 hasBestOaLocation W29635354862 @default.
- W2963535486 hasConcept C103278499 @default.
- W2963535486 hasConcept C111919701 @default.
- W2963535486 hasConcept C115961682 @default.
- W2963535486 hasConcept C116738811 @default.
- W2963535486 hasConcept C124101348 @default.
- W2963535486 hasConcept C127705205 @default.
- W2963535486 hasConcept C154945302 @default.
- W2963535486 hasConcept C177264268 @default.
- W2963535486 hasConcept C199360897 @default.
- W2963535486 hasConcept C33923547 @default.
- W2963535486 hasConcept C41008148 @default.
- W2963535486 hasConcept C58489278 @default.
- W2963535486 hasConcept C75165309 @default.
- W2963535486 hasConcept C80444323 @default.
- W2963535486 hasConceptScore W2963535486C103278499 @default.
- W2963535486 hasConceptScore W2963535486C111919701 @default.
- W2963535486 hasConceptScore W2963535486C115961682 @default.
- W2963535486 hasConceptScore W2963535486C116738811 @default.
- W2963535486 hasConceptScore W2963535486C124101348 @default.
- W2963535486 hasConceptScore W2963535486C127705205 @default.
- W2963535486 hasConceptScore W2963535486C154945302 @default.
- W2963535486 hasConceptScore W2963535486C177264268 @default.
- W2963535486 hasConceptScore W2963535486C199360897 @default.
- W2963535486 hasConceptScore W2963535486C33923547 @default.
- W2963535486 hasConceptScore W2963535486C41008148 @default.
- W2963535486 hasConceptScore W2963535486C58489278 @default.
- W2963535486 hasConceptScore W2963535486C75165309 @default.
- W2963535486 hasConceptScore W2963535486C80444323 @default.
- W2963535486 hasFunder F4320338335 @default.
- W2963535486 hasLocation W29635354861 @default.
- W2963535486 hasLocation W29635354862 @default.
- W2963535486 hasOpenAccess W2963535486 @default.
- W2963535486 hasPrimaryLocation W29635354861 @default.
- W2963535486 hasRelatedWork W1480566255 @default.
- W2963535486 hasRelatedWork W1610355325 @default.
- W2963535486 hasRelatedWork W1800174300 @default.
- W2963535486 hasRelatedWork W2013685631 @default.
- W2963535486 hasRelatedWork W2023309578 @default.
- W2963535486 hasRelatedWork W2350256438 @default.
- W2963535486 hasRelatedWork W2950930770 @default.
- W2963535486 hasRelatedWork W2980679393 @default.
- W2963535486 hasRelatedWork W4297668619 @default.
- W2963535486 hasRelatedWork W4308950455 @default.