Matches in SemOpenAlex for { <https://semopenalex.org/work/W4360990650> ?p ?o ?g. }
- W4360990650 abstract "Abstract A central assumption of all machine learning is that the training data are an informative subset of the true distribution we want to learn. Yet, this assumption may be violated in practice. Recently, learning from the molecular structures of small molecules has moved into the focus of the machine learning community. Usually, those small molecules are of biological interest, such as metabolites or drugs. Applications include prediction of toxicity, ligand binding or retention time. We investigate how well certain large-scale datasets cover the space of all known biomolecular structures. Investigation of coverage requires a sensible distance measure between molecular structures. We use a well-known distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which agrees well with the chemical and biochemical intuition of similarity between compounds. Unfortunately, this computational problem is NP-hard, severely restricting the use of the corresponding distance measure in large-scale studies. We introduce an exact approach that combines Integer Linear Programming and intricate heuristic bounds to ensure efficient computations and dependable results. We find that several large-scale datasets frequently used in this domain of machine learning are far from a uniform coverage of known biomolecular structures. This severely confines the predictive power of models trained on this data. Next, we propose two further approaches to check if a training dataset differs substantially from the distribution of known biomolecular structures. On the positive side, our methods may allow creators of large-scale datasets to identify regions in molecular structure space where it is advisable to provide additional training data." @default.
- W4360990650 created "2023-03-30" @default.
- W4360990650 creator A5033025664 @default.
- W4360990650 creator A5070388747 @default.
- W4360990650 creator A5078497495 @default.
- W4360990650 creator A5086323965 @default.
- W4360990650 creator A5091319914 @default.
- W4360990650 date "2023-03-27" @default.
- W4360990650 modified "2023-09-30" @default.
- W4360990650 title "Small molecule machine learning: All models are wrong, some may not even be useful" @default.
- W4360990650 cites W1601495365 @default.
- W4360990650 cites W192266888 @default.
- W4360990650 cites W1963920197 @default.
- W4360990650 cites W1965555277 @default.
- W4360990650 cites W1982870104 @default.
- W4360990650 cites W1998056319 @default.
- W4360990650 cites W2000869402 @default.
- W4360990650 cites W2001431291 @default.
- W4360990650 cites W2006575776 @default.
- W4360990650 cites W2011451331 @default.
- W4360990650 cites W2012459404 @default.
- W4360990650 cites W2014436490 @default.
- W4360990650 cites W2023624916 @default.
- W4360990650 cites W2027266940 @default.
- W4360990650 cites W2044834685 @default.
- W4360990650 cites W2051658231 @default.
- W4360990650 cites W2059327215 @default.
- W4360990650 cites W2062556965 @default.
- W4360990650 cites W2064963922 @default.
- W4360990650 cites W2076498053 @default.
- W4360990650 cites W2080635178 @default.
- W4360990650 cites W2096725584 @default.
- W4360990650 cites W2103626206 @default.
- W4360990650 cites W2110256992 @default.
- W4360990650 cites W2137676811 @default.
- W4360990650 cites W2140611297 @default.
- W4360990650 cites W2145578524 @default.
- W4360990650 cites W2149342630 @default.
- W4360990650 cites W2150031663 @default.
- W4360990650 cites W2156077095 @default.
- W4360990650 cites W2160114756 @default.
- W4360990650 cites W2172024214 @default.
- W4360990650 cites W2175779775 @default.
- W4360990650 cites W2177317049 @default.
- W4360990650 cites W2179948434 @default.
- W4360990650 cites W2200017991 @default.
- W4360990650 cites W2276859037 @default.
- W4360990650 cites W2401610261 @default.
- W4360990650 cites W2406943157 @default.
- W4360990650 cites W2412446857 @default.
- W4360990650 cites W2461470610 @default.
- W4360990650 cites W2473190403 @default.
- W4360990650 cites W2504691963 @default.
- W4360990650 cites W2548357532 @default.
- W4360990650 cites W2558428302 @default.
- W4360990650 cites W2565684601 @default.
- W4360990650 cites W2594183968 @default.
- W4360990650 cites W2767683865 @default.
- W4360990650 cites W2900090807 @default.
- W4360990650 cites W2922522932 @default.
- W4360990650 cites W2944975820 @default.
- W4360990650 cites W2966357564 @default.
- W4360990650 cites W2996714860 @default.
- W4360990650 cites W2998720855 @default.
- W4360990650 cites W3012519883 @default.
- W4360990650 cites W3097280976 @default.
- W4360990650 cites W3108604517 @default.
- W4360990650 cites W3113296199 @default.
- W4360990650 cites W3118695441 @default.
- W4360990650 cites W3133965623 @default.
- W4360990650 cites W3135127269 @default.
- W4360990650 cites W3165300194 @default.
- W4360990650 cites W3185391990 @default.
- W4360990650 cites W3186118520 @default.
- W4360990650 cites W3193966860 @default.
- W4360990650 cites W3200707343 @default.
- W4360990650 cites W3206878019 @default.
- W4360990650 cites W3209805961 @default.
- W4360990650 cites W4210707438 @default.
- W4360990650 cites W4214868967 @default.
- W4360990650 cites W4230770774 @default.
- W4360990650 cites W4238781491 @default.
- W4360990650 cites W4246354968 @default.
- W4360990650 cites W4293068700 @default.
- W4360990650 cites W4306986298 @default.
- W4360990650 doi "https://doi.org/10.1101/2023.03.27.534311" @default.
- W4360990650 hasPublicationYear "2023" @default.
- W4360990650 type Work @default.
- W4360990650 citedByCount "0" @default.
- W4360990650 crossrefType "posted-content" @default.
- W4360990650 hasAuthorship W4360990650A5033025664 @default.
- W4360990650 hasAuthorship W4360990650A5070388747 @default.
- W4360990650 hasAuthorship W4360990650A5078497495 @default.
- W4360990650 hasAuthorship W4360990650A5086323965 @default.
- W4360990650 hasAuthorship W4360990650A5091319914 @default.
- W4360990650 hasBestOaLocation W43609906501 @default.
- W4360990650 hasConcept C111472728 @default.
- W4360990650 hasConcept C11413529 @default.
- W4360990650 hasConcept C119857082 @default.
- W4360990650 hasConcept C121332964 @default.