Matches in SemOpenAlex for { <https://semopenalex.org/work/W4383989078> ?p ?o ?g. }
Showing items 1 to 49 of
49
with 100 items per page.
- W4383989078 abstract "Materials datasets are usually featured by the existence of many redundant (highly similar) materials due to the tinkering material design practice over the history of materials research. For example, the materials project database has many perovskite cubic structure materials similar to SrTiO$_3$. This sample redundancy within the dataset makes the random splitting of machine learning model evaluation to fail so that the ML models tend to achieve over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the field of bioinformatics for protein function prediction, in which a redundancy reduction procedure (CD-Hit) is always applied to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given threshold. This paper surveys the overestimated ML performance in the literature for both composition based and structure based material property prediction. We then propose a material dataset redundancy reduction algorithm called MD-HIT and evaluate it with several composition and structure based distance threshold sfor reducing data set sample redundancy. We show that with this control, the predicted performance tends to better reflect their true prediction capability. Our MD-hit code can be freely accessed at https://github.com/usccolumbia/MD-HIT" @default.
- W4383989078 created "2023-07-12" @default.
- W4383989078 creator A5029265145 @default.
- W4383989078 creator A5048652419 @default.
- W4383989078 creator A5060537711 @default.
- W4383989078 creator A5088863045 @default.
- W4383989078 date "2023-07-10" @default.
- W4383989078 modified "2023-09-23" @default.
- W4383989078 title "MD-HIT: Machine learning for materials property prediction with dataset redundancy control" @default.
- W4383989078 doi "https://doi.org/10.48550/arxiv.2307.04351" @default.
- W4383989078 hasPublicationYear "2023" @default.
- W4383989078 type Work @default.
- W4383989078 citedByCount "0" @default.
- W4383989078 crossrefType "posted-content" @default.
- W4383989078 hasAuthorship W4383989078A5029265145 @default.
- W4383989078 hasAuthorship W4383989078A5048652419 @default.
- W4383989078 hasAuthorship W4383989078A5060537711 @default.
- W4383989078 hasAuthorship W4383989078A5088863045 @default.
- W4383989078 hasBestOaLocation W43839890781 @default.
- W4383989078 hasConcept C111919701 @default.
- W4383989078 hasConcept C11413529 @default.
- W4383989078 hasConcept C119857082 @default.
- W4383989078 hasConcept C124101348 @default.
- W4383989078 hasConcept C152124472 @default.
- W4383989078 hasConcept C154945302 @default.
- W4383989078 hasConcept C41008148 @default.
- W4383989078 hasConceptScore W4383989078C111919701 @default.
- W4383989078 hasConceptScore W4383989078C11413529 @default.
- W4383989078 hasConceptScore W4383989078C119857082 @default.
- W4383989078 hasConceptScore W4383989078C124101348 @default.
- W4383989078 hasConceptScore W4383989078C152124472 @default.
- W4383989078 hasConceptScore W4383989078C154945302 @default.
- W4383989078 hasConceptScore W4383989078C41008148 @default.
- W4383989078 hasLocation W43839890781 @default.
- W4383989078 hasOpenAccess W4383989078 @default.
- W4383989078 hasPrimaryLocation W43839890781 @default.
- W4383989078 hasRelatedWork W1538624230 @default.
- W4383989078 hasRelatedWork W2961085424 @default.
- W4383989078 hasRelatedWork W3046775127 @default.
- W4383989078 hasRelatedWork W4225307033 @default.
- W4383989078 hasRelatedWork W4285260836 @default.
- W4383989078 hasRelatedWork W4286629047 @default.
- W4383989078 hasRelatedWork W4306321456 @default.
- W4383989078 hasRelatedWork W4306674287 @default.
- W4383989078 hasRelatedWork W2795025438 @default.
- W4383989078 hasRelatedWork W4224009465 @default.
- W4383989078 isParatext "false" @default.
- W4383989078 isRetracted "false" @default.
- W4383989078 workType "article" @default.