Matches in SemOpenAlex for { <https://semopenalex.org/work/W1592090717> ?p ?o ?g. }
- W1592090717 startingPage "25" @default.
- W1592090717 abstract "Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer and requires at most a single database scan. Data resolution is preserved to the extent possible based upon the size of the main memory buffer and the fit of the current clustering model to the data. We extend the method to efficiently update multiple models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based approaches – the straightforward alternatives to “scaling” traditional in-memory implementations to large databases. 1 Preliminaries and Motivation Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization [DH73]. Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, general data reporting tasks, data cleaning and exploratory data analysis [B*96]. Clustering is a crucial data mining step and performing this task over large databases is essential. Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 2 A general view of clustering places it in the framework of density estimation [S86, S92, A73]. Clustering can be viewed as identifying the dense regions of the data source. An efficient representation of the probability density function is the mixture model, which asserts that the data is a combination of k individual component densities, corresponding to the k clusters. Basically, the problem is this: given data records (observations), identify a set of k populations in the data, and provide a model (density distribution) of each of the populations. Since the model assumes a mixture of populations, it is often referred to as a mixture model. The Expectation-Maximization (EM) algorithm [DLR77, CS96] is an effective and popular technique for estimating the mixture model parameters or fitting the model to the database. The EM algorithm iteratively refines an initial cluster model to better fit the data and terminates at a solution which is locally optimal or a saddle point of the underlying clustering criterion [DLR77, B95]. The objective function is log-likelihood of the data given the model measuring how well the probabilistic model fits the data. Other similar iterative refinement clustering methods include the popular K-Means-type algorithms [M67,DH73,F90,BMS97,SI84]. While these approaches have received attention in the database and data mining literature [NH94,ZRL97,BFR98], they are limited in their ability to compute correct statistical models of the data. The K-Mean algorithm minimizes the sum of squared Euclidean distances of between data records in a cluster and the cluster’s mean vector. This assignment criterion implicitly assumes that clusters are represented by spherical Gaussian distributions located at the k cluster means [BB95, B95]. Since the K-Mean algorithm utilizes the Euclidean metric, it does not generalize to the problem of clustering discrete or categorical data. The K-Mean algorithm also uses a membership function which assigns each data record to exactly one cluster. This harsh criteria does not allow for uncertainty in the membership of a data record in a cluster. The mixture model framework relaxes these assumptions. Due to the probabilistic nature of the mixture model, arbitrary shaped clusters (i.e. non-spherical, etc.) can be effectively represented by the choice of suitable component density functions (e.g. Poission, non-spherical Gaussians, etc.). Categorical or discrete data is similarly handled by associating discrete data distribution over these attributes (e.g. Mutinomial, Binomial, etc.). Consider a simple example with data consisting of 2 attributes: age and income. One may choose to model the data as a single cluster and report that average age over the data records is 41 years and an average income is $26K/year (with associated variances). However, this may be rather deceptive and uninformative. The data may be a mixture of working people, retired people, and Scaling EM Clustering to Large Databases Bradley, Fayyad, and Reina 3 children. A more informative summary might identify these subsets or clusters, and report the cluster parameters. Such results are shown in Table 1.1: Table 1.1: Sample data summary by segment “name” (not given) Size Average Age Average Income “working” 45% 38 $45K “retired” 30% 72 $20K “children” 20% 12 $0" @default.
- W1592090717 created "2016-06-24" @default.
- W1592090717 creator A5025248878 @default.
- W1592090717 creator A5079602288 @default.
- W1592090717 creator A5090941829 @default.
- W1592090717 date "1998-11-01" @default.
- W1592090717 modified "2023-09-27" @default.
- W1592090717 title "Scaling EM (Expectation Maximization) Clustering to Large Databases" @default.
- W1592090717 cites W147860157 @default.
- W1592090717 cites W1493454437 @default.
- W1592090717 cites W1524704912 @default.
- W1592090717 cites W1554663460 @default.
- W1592090717 cites W1569279788 @default.
- W1592090717 cites W1575476631 @default.
- W1592090717 cites W1582937231 @default.
- W1592090717 cites W1602329118 @default.
- W1592090717 cites W1611682757 @default.
- W1592090717 cites W1746680969 @default.
- W1592090717 cites W1969357402 @default.
- W1592090717 cites W1977496278 @default.
- W1592090717 cites W1981038597 @default.
- W1592090717 cites W2041674806 @default.
- W1592090717 cites W204885769 @default.
- W1592090717 cites W2049633694 @default.
- W1592090717 cites W2068289711 @default.
- W1592090717 cites W2073308541 @default.
- W1592090717 cites W2073849744 @default.
- W1592090717 cites W2082503527 @default.
- W1592090717 cites W2105535594 @default.
- W1592090717 cites W2115665694 @default.
- W1592090717 cites W2118587067 @default.
- W1592090717 cites W2126751256 @default.
- W1592090717 cites W2127218421 @default.
- W1592090717 cites W2129905273 @default.
- W1592090717 cites W2131687179 @default.
- W1592090717 cites W2135346934 @default.
- W1592090717 cites W2138745909 @default.
- W1592090717 cites W2141245797 @default.
- W1592090717 cites W2148141518 @default.
- W1592090717 cites W2567948266 @default.
- W1592090717 cites W2914866334 @default.
- W1592090717 cites W3017143921 @default.
- W1592090717 cites W370143576 @default.
- W1592090717 hasPublicationYear "1998" @default.
- W1592090717 type Work @default.
- W1592090717 sameAs 1592090717 @default.
- W1592090717 citedByCount "67" @default.
- W1592090717 countsByYear W15920907172012 @default.
- W1592090717 countsByYear W15920907172013 @default.
- W1592090717 countsByYear W15920907172014 @default.
- W1592090717 countsByYear W15920907172015 @default.
- W1592090717 countsByYear W15920907172016 @default.
- W1592090717 countsByYear W15920907172017 @default.
- W1592090717 countsByYear W15920907172019 @default.
- W1592090717 countsByYear W15920907172020 @default.
- W1592090717 crossrefType "journal-article" @default.
- W1592090717 hasAuthorship W1592090717A5025248878 @default.
- W1592090717 hasAuthorship W1592090717A5079602288 @default.
- W1592090717 hasAuthorship W1592090717A5090941829 @default.
- W1592090717 hasConcept C104047586 @default.
- W1592090717 hasConcept C11413529 @default.
- W1592090717 hasConcept C124101348 @default.
- W1592090717 hasConcept C154945302 @default.
- W1592090717 hasConcept C193143536 @default.
- W1592090717 hasConcept C27964816 @default.
- W1592090717 hasConcept C33704608 @default.
- W1592090717 hasConcept C41008148 @default.
- W1592090717 hasConcept C48044578 @default.
- W1592090717 hasConcept C73555534 @default.
- W1592090717 hasConcept C77088390 @default.
- W1592090717 hasConcept C94641424 @default.
- W1592090717 hasConceptScore W1592090717C104047586 @default.
- W1592090717 hasConceptScore W1592090717C11413529 @default.
- W1592090717 hasConceptScore W1592090717C124101348 @default.
- W1592090717 hasConceptScore W1592090717C154945302 @default.
- W1592090717 hasConceptScore W1592090717C193143536 @default.
- W1592090717 hasConceptScore W1592090717C27964816 @default.
- W1592090717 hasConceptScore W1592090717C33704608 @default.
- W1592090717 hasConceptScore W1592090717C41008148 @default.
- W1592090717 hasConceptScore W1592090717C48044578 @default.
- W1592090717 hasConceptScore W1592090717C73555534 @default.
- W1592090717 hasConceptScore W1592090717C77088390 @default.
- W1592090717 hasConceptScore W1592090717C94641424 @default.
- W1592090717 hasLocation W15920907171 @default.
- W1592090717 hasOpenAccess W1592090717 @default.
- W1592090717 hasPrimaryLocation W15920907171 @default.
- W1592090717 hasRelatedWork W147860157 @default.
- W1592090717 hasRelatedWork W1575476631 @default.
- W1592090717 hasRelatedWork W1634005169 @default.
- W1592090717 hasRelatedWork W1673310716 @default.
- W1592090717 hasRelatedWork W1971784203 @default.
- W1592090717 hasRelatedWork W1977496278 @default.
- W1592090717 hasRelatedWork W1992419399 @default.
- W1592090717 hasRelatedWork W1999668761 @default.
- W1592090717 hasRelatedWork W2049633694 @default.
- W1592090717 hasRelatedWork W2095897464 @default.
- W1592090717 hasRelatedWork W2117853077 @default.
- W1592090717 hasRelatedWork W2121328882 @default.
- W1592090717 hasRelatedWork W2127218421 @default.