Matches in SemOpenAlex for { <https://semopenalex.org/work/W2964648167> ?p ?o ?g. }
- W2964648167 abstract "Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources.: https://bitbucket.org/biodbqual/benchmarks." @default.
- W2964648167 created "2019-08-13" @default.
- W2964648167 creator A5041495909 @default.
- W2964648167 creator A5042874172 @default.
- W2964648167 creator A5067214173 @default.
- W2964648167 date "2017-01-08" @default.
- W2964648167 modified "2023-09-26" @default.
- W2964648167 title "Benchmarks for measurement of duplicate detection methods in nucleotide databases" @default.
- W2964648167 cites W1597164057 @default.
- W2964648167 cites W1854015338 @default.
- W2964648167 cites W1975914194 @default.
- W2964648167 cites W2002180966 @default.
- W2964648167 cites W2019096410 @default.
- W2964648167 cites W2021952182 @default.
- W2964648167 cites W2023448865 @default.
- W2964648167 cites W2031250218 @default.
- W2964648167 cites W2055214702 @default.
- W2964648167 cites W2062296203 @default.
- W2964648167 cites W2065301447 @default.
- W2964648167 cites W2088027966 @default.
- W2964648167 cites W2102461176 @default.
- W2964648167 cites W2116699248 @default.
- W2964648167 cites W2124410686 @default.
- W2964648167 cites W2125353047 @default.
- W2964648167 cites W2129800387 @default.
- W2964648167 cites W2130253098 @default.
- W2964648167 cites W2138088864 @default.
- W2964648167 cites W2142678478 @default.
- W2964648167 cites W2146980885 @default.
- W2964648167 cites W2148448264 @default.
- W2964648167 cites W2152571127 @default.
- W2964648167 cites W2156125289 @default.
- W2964648167 cites W2156357245 @default.
- W2964648167 cites W2164260925 @default.
- W2964648167 cites W2170747616 @default.
- W2964648167 cites W2191378904 @default.
- W2964648167 cites W2191986163 @default.
- W2964648167 cites W2212657187 @default.
- W2964648167 cites W2493860774 @default.
- W2964648167 cites W2739999456 @default.
- W2964648167 cites W3146259567 @default.
- W2964648167 cites W4230612844 @default.
- W2964648167 cites W4235121031 @default.
- W2964648167 cites W4247554111 @default.
- W2964648167 doi "https://doi.org/10.1093/database/baw164" @default.
- W2964648167 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/28334741" @default.
- W2964648167 hasPublicationYear "2017" @default.
- W2964648167 type Work @default.
- W2964648167 sameAs 2964648167 @default.
- W2964648167 citedByCount "10" @default.
- W2964648167 countsByYear W29646481672017 @default.
- W2964648167 countsByYear W29646481672018 @default.
- W2964648167 countsByYear W29646481672019 @default.
- W2964648167 countsByYear W29646481672020 @default.
- W2964648167 countsByYear W29646481672021 @default.
- W2964648167 countsByYear W29646481672022 @default.
- W2964648167 crossrefType "journal-article" @default.
- W2964648167 hasAuthorship W2964648167A5041495909 @default.
- W2964648167 hasAuthorship W2964648167A5042874172 @default.
- W2964648167 hasAuthorship W2964648167A5067214173 @default.
- W2964648167 hasBestOaLocation W29646481671 @default.
- W2964648167 hasConcept C104317684 @default.
- W2964648167 hasConcept C111919701 @default.
- W2964648167 hasConcept C124101348 @default.
- W2964648167 hasConcept C13280743 @default.
- W2964648167 hasConcept C152124472 @default.
- W2964648167 hasConcept C185798385 @default.
- W2964648167 hasConcept C202264299 @default.
- W2964648167 hasConcept C205649164 @default.
- W2964648167 hasConcept C20901353 @default.
- W2964648167 hasConcept C23123220 @default.
- W2964648167 hasConcept C41008148 @default.
- W2964648167 hasConcept C54355233 @default.
- W2964648167 hasConcept C60644358 @default.
- W2964648167 hasConcept C77088390 @default.
- W2964648167 hasConcept C86803240 @default.
- W2964648167 hasConceptScore W2964648167C104317684 @default.
- W2964648167 hasConceptScore W2964648167C111919701 @default.
- W2964648167 hasConceptScore W2964648167C124101348 @default.
- W2964648167 hasConceptScore W2964648167C13280743 @default.
- W2964648167 hasConceptScore W2964648167C152124472 @default.
- W2964648167 hasConceptScore W2964648167C185798385 @default.
- W2964648167 hasConceptScore W2964648167C202264299 @default.
- W2964648167 hasConceptScore W2964648167C205649164 @default.
- W2964648167 hasConceptScore W2964648167C20901353 @default.
- W2964648167 hasConceptScore W2964648167C23123220 @default.
- W2964648167 hasConceptScore W2964648167C41008148 @default.
- W2964648167 hasConceptScore W2964648167C54355233 @default.
- W2964648167 hasConceptScore W2964648167C60644358 @default.
- W2964648167 hasConceptScore W2964648167C77088390 @default.
- W2964648167 hasConceptScore W2964648167C86803240 @default.
- W2964648167 hasFunder F4320334704 @default.
- W2964648167 hasLocation W29646481671 @default.
- W2964648167 hasLocation W29646481672 @default.
- W2964648167 hasLocation W29646481673 @default.
- W2964648167 hasOpenAccess W2964648167 @default.
- W2964648167 hasPrimaryLocation W29646481671 @default.
- W2964648167 hasRelatedWork W1988556220 @default.
- W2964648167 hasRelatedWork W2099230441 @default.
- W2964648167 hasRelatedWork W2100965251 @default.