Matches in SemOpenAlex for { <https://semopenalex.org/work/W4362697212> ?p ?o ?g. }
Showing items 1 to 62 of
62
with 100 items per page.
- W4362697212 endingPage "1106" @default.
- W4362697212 startingPage "1102" @default.
- W4362697212 abstract "Dr. Anderson-Cook and Lu's paper on the synergies between big data (BD) and designed data collection (DDC) is an outstanding synthesis of how multiple data collection strategies can benefit from one another throughout the data lifecycle.1 The paper's 11 opportunities show how all phases of data collection (planning, during collection, and after collection) can benefit from thinking about the design of the data and what questions it will be used to answer. A fundamental truth that appears throughout the paper, but deserves calling out, is that the reason we collect data fundamentally is to inform decision-making, and that decision starts with a problem that needs to be solved. I especially appreciate how the paper is structured to strategies that can be applied before BD, during actual BD, and after the collection of BD. This construct highlights the importance of a data lifecycle. It also makes the point that there is never a wrong time in the data lifecycle to re-evaluate if you have the right data to address the problem that needs to be solved. Another point that resonates with me in this article, is that data science in the era of BD is really a team problem. Anderson-Cook and Lu emphasize expert knowledge of the process under study (what problem we are solving), statistical expertise in DDC, and engineering problem-solving skills are all needed. Based on my experience, the team is even bigger, expertise is needed in computer science, databases, computer and network architectures, data pipelines, data security, networks security, and modelers of all types (statistical, machine learning, deep learning, computational, etc.), and data visualization are all important members of the team. The exact makeup of the team ties back to the problem that needs to be solved and who it needs to be communicated to. Finally, I would like to propose that the case for DDC is even stronger in the era of BD, and it goes beyond just synergies between DDC and BD. BD has significant cost, security, and other negative aspects associated with it to include the challenge of finding the signal in the noise of messy data, the cost of storing and disseminating BD, and answering the question of when can data be deleted. I am sure many other statisticians and data scientists will resonate with a frequent problem that I encounter in my work, where data has been collected or recovered, often at a significant cost, and the conjecture is made, “we have all of this data, there must be some key insights in there,” or, “we have all of this data collected to answer problem Y, surely you can repurpose it to solve problem Z.” I fondly refer to this scenario as the BD rabbit hole, and I have been down the BD rabbit hole numerous times in my career. So far without fail, the end result is typically an exhausted analyst and a disappointed decision-maker, because the data was never designed to answer the problem they are trying to address. I concur with Anderson-Cook and Lu that neither BD nor DDC can replace the other, but I propose two additional lenses to view the DDC and BD discussion through that I believe focus the conversation to not the synergies between DDC and BD, but rather, how DDC should always be used in support of any BD collection effort. These two additional concepts that bear discussion include information quality and the cost of BD. Two separate, but related lines of work have emphasized the concept that all data is not equal in its ability to answer questions and that more attention should be placed on the quality of data collected and retained. Wilkinson et al. defined the four guiding principles for scientific data: findable, accessible, interoperable, reusable (FAIR) within the context of enabling reuse of scientific data in scholarly studies.2 Others have expanded on the principles to include concepts such as security, trustworthiness, and data pedigree. The Department of Defense, where I focus many of my research efforts, has adopted a Data Strategy that provides a set of data quality goals, requiring all DoD data to be visible, accessible, understandable, linked, trustworthy, interoperable, and secure (VAULTIS).3 All of these guiding principles are useful in thinking about the synergies between BD and DDC, in that DDC can be structured to increase the likelihood that the principles are achievable. Another line of research has termed the concept of information quality. This line of research arguably dates back to around 1990, which a heavy emphasis emerging in the late 1990s.4 Over time, with the increased collection of BD, a shift has occurred from the concept of data quality to information quality.5 Kenett and Shmueli define information quality as ‘‘the potential of a data set to achieve a specific goal by using a given empirical analysis method.6″ Two different views on information quality that have a high correlation to FAIR and VAULTIS are shown in Table 1. There are also numerous frameworks for assessing information quality. Lee et al. developed a validated scale approach including 65 information quality assessment items.8 Kenett and Shmueli take a different approach and expand on the framework for information quality by mathematically linking the analysis goal to the to the data via a utility function.6 A key takeaway from all of these dimensions of information quality is that the most important quality dimensions come back to the problem that the data is being collected to solve. The concept of information quality is important to consider when thinking about the benefits of DDC in the BD era in that DDC efforts have the ability to purposely improve information quality and focus on quality aspects that are most important to the analysis goal. For example, in the network traffic streaming problem presented by Anderson-Cook and Lu, if the data is being collected to identify potential malicious traffic in the network, then prioritizing information extracted from aggregated packet data such as NetFlow will provide more timely data over full packet capture (e.g., pcap) due to the decreased size of the data. However, timeliness might come at the cost of losing important information about the network traffic contained in full packets that might be useful in identifying malicious behavior. In a DDC these tradeoffs can be assessed and purposefully integrated into the process of data collection. Moreover, DDC provides the opportunity to leverage statistical approaches for increasing information quality. Kenett and Shmueli highlight randomization, blocking, replication, blinding, placebo treatments, and linking of the collection protocol to the dataset as strategies for increasing the information quality of DDCs.6 In addition to increasing the quality of data collected, designed data strategies provide an opportunity to ensure adequate data in terms of quantity and quality data are maintained overtime. Data collection strategies should always tie back to the problem, and we should seek to collect and store the minimally sufficient data to answer the problems at hand. Notably, BD has a significant cost associate with the collection, curation/processing, dissemination, use/analysis of, storage, and deposition of the data both in hardware and non-hardware domains. Local to me, one only needs to drive out the Dulles Toll Road in VA to see the number of data storage facilities and the power, space, and cost associated with storing data. A local article captured the data center footprint at 26 million square feet in May of 2021, with potential to more than double in the future.9 Tallon addresses the trade space between value, risk, and cost of BD from a corporate perspective and emphasizes that data governance practices are key to striking the right balance.10 Figure 1 below shows the conceptual value of data over time as it pertains to informational decision-making. As more government organizations promote data strategies and leveraging data in government decision-making, we need strategies to reduce data holdings when their value has dwindled. This is especially important in the government space, as there are competing policies and guidance that require data and record retention, for example, federal records management policies. Policies need to be developed on culling data to the minimal set required to answer questions over time. Importantly, statistical methods such as efficient sampling and statistical process control that Anderson-Cook and Lu highlight provide useful tools in reducing data stored over time as its value decreases. Additionally, techniques such as change point analysis, dimensionality reduction, and even the concept of a sufficient statistic provide context to what data is important to retain near-term and long-term. I expect these examples are just a small sample of an emerging body of work that will highlight the importance of design data collection in the ML/AI space. I will conclude by thanking and congratulating Anderson-Cook and Lu for providing an insightful article that will continue to advance the discussion on the value of data, BD, and the importance of DDC. Dr. Laura Freeman, is a Research Associate Professor of Statistics and dual hatted as the Deputy Director of the Virginia Tech National Security Institute and Assistant Dean for Research for the College of Science. Her research leverages experimental methods for conducting research that brings together cyber-physical systems, data science, artificial intelligence (AI), and machine learning (ML) to address critical challenges in national security. She develops new methods for test and evaluation focusing on emerging system technology. Dr. Freeman has a BS in Aerospace Engineering, a MS in Statistics and a PhD in Statistics, all from Virginia Tech. Her PhD research was on design and analysis of experiments for reliability data. None." @default.
- W4362697212 created "2023-04-09" @default.
- W4362697212 creator A5059050900 @default.
- W4362697212 date "2023-04-07" @default.
- W4362697212 modified "2023-09-26" @default.
- W4362697212 title "Review: Is design data collection still relevant in the big data era? With extensions to machine learning" @default.
- W4362697212 cites W2024309672 @default.
- W4362697212 cites W2117825401 @default.
- W4362697212 cites W2135713994 @default.
- W4362697212 cites W2138752966 @default.
- W4362697212 cites W2302501749 @default.
- W4362697212 cites W3125555312 @default.
- W4362697212 cites W3195339943 @default.
- W4362697212 cites W3196973870 @default.
- W4362697212 cites W3210046735 @default.
- W4362697212 cites W4210763918 @default.
- W4362697212 cites W4282541642 @default.
- W4362697212 doi "https://doi.org/10.1002/qre.3341" @default.
- W4362697212 hasPublicationYear "2023" @default.
- W4362697212 type Work @default.
- W4362697212 citedByCount "0" @default.
- W4362697212 crossrefType "journal-article" @default.
- W4362697212 hasAuthorship W4362697212A5059050900 @default.
- W4362697212 hasBestOaLocation W43626972121 @default.
- W4362697212 hasConcept C105795698 @default.
- W4362697212 hasConcept C119857082 @default.
- W4362697212 hasConcept C124101348 @default.
- W4362697212 hasConcept C133462117 @default.
- W4362697212 hasConcept C154945302 @default.
- W4362697212 hasConcept C2522767166 @default.
- W4362697212 hasConcept C33923547 @default.
- W4362697212 hasConcept C41008148 @default.
- W4362697212 hasConcept C75684735 @default.
- W4362697212 hasConceptScore W4362697212C105795698 @default.
- W4362697212 hasConceptScore W4362697212C119857082 @default.
- W4362697212 hasConceptScore W4362697212C124101348 @default.
- W4362697212 hasConceptScore W4362697212C133462117 @default.
- W4362697212 hasConceptScore W4362697212C154945302 @default.
- W4362697212 hasConceptScore W4362697212C2522767166 @default.
- W4362697212 hasConceptScore W4362697212C33923547 @default.
- W4362697212 hasConceptScore W4362697212C41008148 @default.
- W4362697212 hasConceptScore W4362697212C75684735 @default.
- W4362697212 hasIssue "4" @default.
- W4362697212 hasLocation W43626972121 @default.
- W4362697212 hasOpenAccess W4362697212 @default.
- W4362697212 hasPrimaryLocation W43626972121 @default.
- W4362697212 hasRelatedWork W1039292361 @default.
- W4362697212 hasRelatedWork W2397053934 @default.
- W4362697212 hasRelatedWork W2617449561 @default.
- W4362697212 hasRelatedWork W2808989540 @default.
- W4362697212 hasRelatedWork W2944507549 @default.
- W4362697212 hasRelatedWork W2961085424 @default.
- W4362697212 hasRelatedWork W3014300295 @default.
- W4362697212 hasRelatedWork W4226104445 @default.
- W4362697212 hasRelatedWork W4306674287 @default.
- W4362697212 hasRelatedWork W4322629366 @default.
- W4362697212 hasVolume "39" @default.
- W4362697212 isParatext "false" @default.
- W4362697212 isRetracted "false" @default.
- W4362697212 workType "article" @default.