Matches in SemOpenAlex for { <https://semopenalex.org/work/W3033682799> ?p ?o ?g. }
- W3033682799 endingPage "e18910" @default.
- W3033682799 startingPage "e18910" @default.
- W3033682799 abstract "Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making." @default.
- W3033682799 created "2020-06-12" @default.
- W3033682799 creator A5034994292 @default.
- W3033682799 creator A5038610271 @default.
- W3033682799 creator A5058929336 @default.
- W3033682799 creator A5063401749 @default.
- W3033682799 creator A5067279312 @default.
- W3033682799 creator A5077309331 @default.
- W3033682799 date "2020-07-20" @default.
- W3033682799 modified "2023-10-11" @default.
- W3033682799 title "Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing" @default.
- W3033682799 cites W1508572302 @default.
- W3033682799 cites W1525535529 @default.
- W3033682799 cites W2003559619 @default.
- W3033682799 cites W2003816899 @default.
- W3033682799 cites W2032105080 @default.
- W3033682799 cites W2033609349 @default.
- W3033682799 cites W2050032749 @default.
- W3033682799 cites W2109651326 @default.
- W3033682799 cites W2111596427 @default.
- W3033682799 cites W2133707138 @default.
- W3033682799 cites W2134282933 @default.
- W3033682799 cites W2544063074 @default.
- W3033682799 cites W2622579191 @default.
- W3033682799 cites W2742634874 @default.
- W3033682799 cites W2751687090 @default.
- W3033682799 cites W2759848381 @default.
- W3033682799 cites W2803236744 @default.
- W3033682799 cites W2803437104 @default.
- W3033682799 cites W2809070429 @default.
- W3033682799 cites W2885057386 @default.
- W3033682799 cites W2889854355 @default.
- W3033682799 cites W2925023100 @default.
- W3033682799 cites W2943425564 @default.
- W3033682799 cites W2953532875 @default.
- W3033682799 cites W2964400072 @default.
- W3033682799 cites W298769045 @default.
- W3033682799 cites W2991368564 @default.
- W3033682799 cites W2996889063 @default.
- W3033682799 cites W3023773730 @default.
- W3033682799 cites W3123449715 @default.
- W3033682799 cites W3126097589 @default.
- W3033682799 cites W4255574744 @default.
- W3033682799 cites W576348434 @default.
- W3033682799 doi "https://doi.org/10.2196/18910" @default.
- W3033682799 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/7400044" @default.
- W3033682799 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/32501278" @default.
- W3033682799 hasPublicationYear "2020" @default.
- W3033682799 type Work @default.
- W3033682799 sameAs 3033682799 @default.
- W3033682799 citedByCount "59" @default.
- W3033682799 countsByYear W30336827992020 @default.
- W3033682799 countsByYear W30336827992021 @default.
- W3033682799 countsByYear W30336827992022 @default.
- W3033682799 countsByYear W30336827992023 @default.
- W3033682799 crossrefType "journal-article" @default.
- W3033682799 hasAuthorship W3033682799A5034994292 @default.
- W3033682799 hasAuthorship W3033682799A5038610271 @default.
- W3033682799 hasAuthorship W3033682799A5058929336 @default.
- W3033682799 hasAuthorship W3033682799A5063401749 @default.
- W3033682799 hasAuthorship W3033682799A5067279312 @default.
- W3033682799 hasAuthorship W3033682799A5077309331 @default.
- W3033682799 hasBestOaLocation W30336827991 @default.
- W3033682799 hasConcept C105795698 @default.
- W3033682799 hasConcept C119857082 @default.
- W3033682799 hasConcept C12267149 @default.
- W3033682799 hasConcept C124101348 @default.
- W3033682799 hasConcept C154945302 @default.
- W3033682799 hasConcept C160920958 @default.
- W3033682799 hasConcept C169258074 @default.
- W3033682799 hasConcept C27158222 @default.
- W3033682799 hasConcept C33923547 @default.
- W3033682799 hasConcept C41008148 @default.
- W3033682799 hasConcept C84525736 @default.
- W3033682799 hasConceptScore W3033682799C105795698 @default.
- W3033682799 hasConceptScore W3033682799C119857082 @default.
- W3033682799 hasConceptScore W3033682799C12267149 @default.
- W3033682799 hasConceptScore W3033682799C124101348 @default.
- W3033682799 hasConceptScore W3033682799C154945302 @default.
- W3033682799 hasConceptScore W3033682799C160920958 @default.
- W3033682799 hasConceptScore W3033682799C169258074 @default.
- W3033682799 hasConceptScore W3033682799C27158222 @default.
- W3033682799 hasConceptScore W3033682799C33923547 @default.
- W3033682799 hasConceptScore W3033682799C41008148 @default.
- W3033682799 hasConceptScore W3033682799C84525736 @default.
- W3033682799 hasIssue "7" @default.
- W3033682799 hasLocation W30336827991 @default.
- W3033682799 hasLocation W30336827992 @default.
- W3033682799 hasLocation W30336827993 @default.
- W3033682799 hasLocation W30336827994 @default.
- W3033682799 hasLocation W30336827995 @default.
- W3033682799 hasOpenAccess W3033682799 @default.
- W3033682799 hasPrimaryLocation W30336827991 @default.
- W3033682799 hasRelatedWork W2004826645 @default.
- W3033682799 hasRelatedWork W2955796858 @default.
- W3033682799 hasRelatedWork W3135818052 @default.
- W3033682799 hasRelatedWork W4200112873 @default.
- W3033682799 hasRelatedWork W4224922629 @default.