Matches in SemOpenAlex for { <https://semopenalex.org/work/W2052805042> ?p ?o ?g. }
Showing items 1 to 73 of
73
with 100 items per page.
- W2052805042 endingPage "112" @default.
- W2052805042 startingPage "110" @default.
- W2052805042 abstract "Big DataVol. 1, No. 2 Original ArticlesOpen AccessBring the Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning AlgorithmsBrian DalessandroBrian DalessandroSearch for more papers by this authorPublished Online:18 Jun 2013https://doi.org/10.1089/big.2013.0010AboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Despite the progress made in scaling up data processing and storage, major bottlenecks still exist when using big data for statistical inference and predictive modeling. Many classic algorithms used for predictive modeling with big data are constrained by the very math that defines them. These algorithms tend to have complexities that do not scale linearly, so doubling your data often means quadrupling the time it takes to learn models. Many of today's statistical modelers have been trained using the algorithms and software that have largely been optimized for a “small data” paradigm. New tools and new ways of approaching many machine learning problems are now called for in order to scale up industrial-sized applications. Fortunately, new theory, intuition, and tools are emerging to solve these problems. In this article, we introduce the works of Leon Bottou and John Langford (both coincidentally at Microsoft Research), two leading researchers in the field whose research missions have been to scale up machine learning. The work by Bottou, Langford, and their colleagues shows us that by adopting algorithmic strategies that classically would be considered inferior, we can paradoxically both scale up machine learning and end up with better predictive performance. From a predictive modeling perspective, more data certainly is better, but only if you can use it. The strategies presented here offer practitioners promising options to making that happen.Understand Your ErrorBottou's and Langford's2,4 research efforts are particularly applicable to the types of predictive modeling that are heavily utilized in industrial applications. The branch of machine learning that governs most predictive modeling methods is called supervised learning. In supervised learning, one uses the features (or variables) available to describe an entity as a basis to predict some other event, item, or feature of interest. Some classic and often-used supervised learning methods are linear and logistic regression, decision trees, and support vector machines. All supervised learning methods are characterized, and in some sense defined, by a loss function. The loss function measures how well the specific model predicts the target variable of interest. The most intuitive of such loss functions is accuracy, but others exist, and each one has specific properties that make it more or less suitable for certain problems over others.If we think of the loss function generally as the total amount of “error” in our predictions, we can actually decompose this error into different components, where each component is driven by certain decisions we make when training the model. One classic decomposition breaks the error down to approximation and estimation error. Estimation error comes from the fact that when building models, we often only have a finite sample of the underlying data at our disposal. If infinite data were available, we may be able to learn exactly how feature inputs should map to the prediction. The difference between the model derived from a finite sample and the model we should expect if the data were infinite is called the estimation error. The approximation error is the difference stemming from how the model should look and how we decide it will look. To exemplify both, let's assume there exists some process defined by: Y=3 * X12 – 2 * X22. Let's assume the modeler uses a linear regression of the form Y=β1 * X1+β2 * X2 to fit this data and has only 100 examples to build the model. The error caused by using linear terms as opposed to quadratic terms would be classified as the approximation error, and the error caused by having only 100 examples is the estimation error.In several articles, Bottou1,2 has introduced a quantity called optimization error to the decomposition of a model's overall loss. The intuition behind optimization error is that some algorithms are better than others at finding the minimum of the loss function being optimized, and the error induced by using an inferior algorithm is the optimization error. An interesting question that follows is: Why choose an algorithm that is expected to perform worse? The answer to this, and a central theme of Bottou's work, is that by deliberately choosing an inferior algorithm we are likely to get better results from a prediction point of view. The reason is that our design choices enable us to trade one type of error for another, and done correctly, we end up better off.The star algorithm of Bottou's work is called stochastic gradient descent (SGD). SGD comes from a family of optimization algorithms called gradient descent, and the more advanced (and complex) versions of the algorithm are used within the native functions of statistical packages such as R, SAS, and Matlab. As data scales up (particularly with the number of features), packaged statistical learning programs become extraordinarily slow. This puts a natural constraint on the amount of data that can be used, which in turn, increases both approximation and estimation errors. SGD on the other hand is defined in such a way that it is perfectly amenable to larger datasets. It searches for a min/max one data point at a time and takes advantage of data sparsity when it exists. The reduced memory dependence and linear scaling with respect to feature dimensionality makes this much more practical in big data settings. The “stochastic” qualifier comes from the fact that SGD approximates characteristics of the entire dataset one line at a time. The search for the solution is more erratic because of this and is why the algorithm classically has been considered to be inferior. So by choosing SGD, one introduces more optimization error into the model, but using more data reduces both estimation and approximation errors. If the data is big enough, the trade-off is favorable to the modeler.In Bottou's words:*Stochastic optimization algorithms were invented in the fifties and have been used very successfully in many fields such as adaptive signal processing. However, they were long regarded with suspicion in machine learning because they can be finicky and often perform poorly on small [data] problems. The last five years have been marked by the realization that these algorithms perform very well on large scale problems (which could seem counter-intuitive at first glance).When Features CollideMANY BREAKTHROUGHS HAPPEN BY EMBRACING WHAT SEEMS COUNTERINTUITIVE AT FIRST.Many breakthroughs happen by embracing what seems counterintuitive at first. Just as Bottou has shown that using an inferior optimization algorithm can actually improve predictive performance, John Langford and colleagues have introduced a dimensionality reduction technique that, despite its simplicity, has proven to be highly effective in big data contexts.3 The new method, called feature hashing, reduces feature dimensionality by randomly assigning each feature in the original space to a new dimension in a lower dimensional space. This is done by simply hashing the ID of the original feature and using the resulting hashed integer as the new feature ID. The works by Langford et al. give clearly written algorithms on how to implement this and under what contexts it should work best.3 This technique creates a trade-off, because with high probability, multiple features will be hashed into the same new feature ID. Generally, all dimensionality reduction techniques will degrade the data to some degree, but most have been designed to preserve geometric qualities of the data as best as possible. Feature hashing, on the other hand, makes no effort to preserve data quality. In the right context though, the cited work shows that the degradation induced by feature hashing is minimal, and its practical benefits more than outweigh the costs. Specifically, feature hashing scales linearly, uses very simple preprocessing, and preserves data sparsity when it exists. These properties make feature hashing the natural (or only) choice when the data scales to millions or billions of features.The power of the feature hashing method is well illustrated in the seminal application given by Langford and colleagues.4 E-mail spam filtering is a good example of modern machine learning problems: 1) The data is large yet sparse, 2) data is generated in a streaming process, 3) solutions need to scale to real-time deployment, and 4) the solution benefits from having mass customization. In the feature hashing work, words within e-mails are used as features, and mass customization is achieved by creating interaction terms between each user and each word. Such a data design creates a feature dimensionality proportional to the number of words times the total number of users. Even though this data matrix is extremely sparse, the size of it could create major problems in the production system that trains and deploys the model. Feature hashing reduces the dimensionality from about 1 billion to 1 million. This is still large, but not so large that the solution becomes intractable. The trade-off again is to degrade the data slightly (by inducing random collisions of the features) but enable the use of more features (i.e., the user-term interactions). From the point of view of error trade-offs, allowing for a more complex and feature-rich model reduces the approximation error, and this reduction is more than enough to offset the error caused by the collisions.Toward Scaling UpMachine learning methods used in modern industrial applications need to be able to scale up in every possible aspect of their use. The three main functions in deployed machine learning applications are data collection/processing, model training, and model execution. Parallelized systems have enabled data processing and retention to happen at unprecedented scale, leaving the bottlenecks in the learning and deployment of predictive models. Fortunately, research groups that bridge the academic and industrial divide have been addressing these challenges directly. The type of research exemplified by Langford, Bottou, and their colleagues offer practical solutions to increasingly common problems in industrial machine learning applications. Their work is more than just mathematical theory; strong empirical results as well as increasing rates of adoption support the proposition that applications of machine learning are indeed scaling up. SGD enables training of prediction models with minimal memory requirements, exploits data sparsity, and even supports a truly streaming approach to training models.Feature hashing brings many benefits to an industrial production system. The reduction of a very large and potentially shifting feature space to a smaller and fixed feature space enables the use of more complex feature representations while keeping the memory constraints of the working system fixed. Additionally, feature hashing enables a more privacy friendly way to store and use personal data. In both of these methods, the research suggests that what at face value might seem like a shortcut is actually an enhancement as the datasets become bigger and bigger. Nothing of course comes free (or at least without necessary caution). Both of these methods involve embracing a little chaos, and in doing so, one risks having chaos in the results. When using SGD and feature hashing, one ought to have a clear understanding of the underlying algorithms and proceed with careful experimentation and testing before deploying. Embraced methodically and properly though, these methods are likely to become de rigueur tools in the big data machine learning arsenal.Author Disclosure StatementNo competing financial interests exist.References1 Bottou L.Bousquet O.The tradeoffs of large-scale learningOptimization for Machine Learning2011351.1. Bottou, L., and Bousquet, O. The tradeoffs of large-scale learning. Optimization for Machine Learning 2011; 351. Google Scholar2 Bottou L.Large-scale machine learning with stochastic gradient descentProceedings of COMPSTAT’2010 Physica-Verlag HD2010177186.2. Bottou, L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010 Physica-Verlag HD 2010; 177–186. Crossref, Google Scholar3 Shi Q.Petterson J.Dror G.et al.Hash kernels for structured dataThe Journal of Machine Learning Research20091026152637.3. Shi, Q., Petterson, J., Dror, G., et al. Hash kernels for structured data. The Journal of Machine Learning Research 2009; 10:2615–2637. Google Scholar4 Weinberger K.Dasgupta A.Langford et al.“Feature hashing for large scale multitask learning.”Proceedings of the 26th Annual International Conference on Machine LearningACM2009.4. Weinberger, K., Dasgupta, A., Langford, et al. “Feature hashing for large scale multitask learning.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Crossref, Google Scholar* Direct correspondence with the authorFiguresReferencesRelatedDetailsCited byPredicting time-series for water demand in the big data environment using statistical methods, machine learning and the novel analog methodology dynamic time scan forecasting23 January 2023 | Water Supply, Vol. 23, No. 2A benchmarking study of classification techniques for behavioral data20 April 2019 | International Journal of Data Science and Analytics, Vol. 9, No. 2Errors Detection Mechanism in Big DataAn efficient system for customer churn prediction through particle swarm optimization based feature selection model with simulated annealing19 September 2017 | Cluster Computing, Vol. 22, No. S5A Survey on Big Data Pre-processingParticle swarm optimization parallelism implementation to classify multiclass datasetsBig Data Reduction Methods: A Survey10 December 2016 | Data Science and Engineering, Vol. 1, No. 4Deep Learning ModelsBig Data AnalyticsBig Data EssentialsDimensionality ReductionContextual anomaly detection framework for big sensor data27 February 2015 | Journal of Big Data, Vol. 2, No. 1Challenges for MapReduce in Big Data Volume 1Issue 2Jun 2013 InformationCopyright 2013, Mary Ann Liebert, Inc.To cite this article:Brian Dalessandro.Bring the Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms.Big Data.Jun 2013.110-112.http://doi.org/10.1089/big.2013.0010creative commons licensePublished in Volume: 1 Issue 2: June 18, 2013PDF download" @default.
- W2052805042 created "2016-06-24" @default.
- W2052805042 creator A5012872167 @default.
- W2052805042 date "2013-06-01" @default.
- W2052805042 modified "2023-09-23" @default.
- W2052805042 title "Bring the Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms" @default.
- W2052805042 cites W114517082 @default.
- W2052805042 cites W2070996757 @default.
- W2052805042 cites W2113651538 @default.
- W2052805042 doi "https://doi.org/10.1089/big.2013.0010" @default.
- W2052805042 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/27442066" @default.
- W2052805042 hasPublicationYear "2013" @default.
- W2052805042 type Work @default.
- W2052805042 sameAs 2052805042 @default.
- W2052805042 citedByCount "16" @default.
- W2052805042 countsByYear W20528050422014 @default.
- W2052805042 countsByYear W20528050422015 @default.
- W2052805042 countsByYear W20528050422016 @default.
- W2052805042 countsByYear W20528050422017 @default.
- W2052805042 countsByYear W20528050422019 @default.
- W2052805042 countsByYear W20528050422020 @default.
- W2052805042 countsByYear W20528050422021 @default.
- W2052805042 countsByYear W20528050422023 @default.
- W2052805042 crossrefType "journal-article" @default.
- W2052805042 hasAuthorship W2052805042A5012872167 @default.
- W2052805042 hasConcept C105795698 @default.
- W2052805042 hasConcept C11413529 @default.
- W2052805042 hasConcept C115961682 @default.
- W2052805042 hasConcept C119857082 @default.
- W2052805042 hasConcept C125112378 @default.
- W2052805042 hasConcept C154945302 @default.
- W2052805042 hasConcept C2524010 @default.
- W2052805042 hasConcept C26517878 @default.
- W2052805042 hasConcept C33923547 @default.
- W2052805042 hasConcept C38652104 @default.
- W2052805042 hasConcept C41008148 @default.
- W2052805042 hasConcept C99498987 @default.
- W2052805042 hasConcept C99844830 @default.
- W2052805042 hasConceptScore W2052805042C105795698 @default.
- W2052805042 hasConceptScore W2052805042C11413529 @default.
- W2052805042 hasConceptScore W2052805042C115961682 @default.
- W2052805042 hasConceptScore W2052805042C119857082 @default.
- W2052805042 hasConceptScore W2052805042C125112378 @default.
- W2052805042 hasConceptScore W2052805042C154945302 @default.
- W2052805042 hasConceptScore W2052805042C2524010 @default.
- W2052805042 hasConceptScore W2052805042C26517878 @default.
- W2052805042 hasConceptScore W2052805042C33923547 @default.
- W2052805042 hasConceptScore W2052805042C38652104 @default.
- W2052805042 hasConceptScore W2052805042C41008148 @default.
- W2052805042 hasConceptScore W2052805042C99498987 @default.
- W2052805042 hasConceptScore W2052805042C99844830 @default.
- W2052805042 hasIssue "2" @default.
- W2052805042 hasLocation W20528050421 @default.
- W2052805042 hasLocation W20528050422 @default.
- W2052805042 hasOpenAccess W2052805042 @default.
- W2052805042 hasPrimaryLocation W20528050421 @default.
- W2052805042 hasRelatedWork W2157060681 @default.
- W2052805042 hasRelatedWork W2329452785 @default.
- W2052805042 hasRelatedWork W2356380379 @default.
- W2052805042 hasRelatedWork W2592098600 @default.
- W2052805042 hasRelatedWork W2961085424 @default.
- W2052805042 hasRelatedWork W3029113222 @default.
- W2052805042 hasRelatedWork W4225307033 @default.
- W2052805042 hasRelatedWork W4286629047 @default.
- W2052805042 hasRelatedWork W4306674287 @default.
- W2052805042 hasRelatedWork W4224009465 @default.
- W2052805042 hasVolume "1" @default.
- W2052805042 isParatext "false" @default.
- W2052805042 isRetracted "false" @default.
- W2052805042 magId "2052805042" @default.
- W2052805042 workType "article" @default.