Matches in SemOpenAlex for { <https://semopenalex.org/work/W2024887626> ?p ?o ?g. }
Showing items 1 to 77 of
77
with 100 items per page.
- W2024887626 endingPage "5" @default.
- W2024887626 startingPage "3" @default.
- W2024887626 abstract "Recent advances in high-throughput flow cytometry (FCM) technology require the theoretical development and the efficient computational implementation of new methods for automated identification of cell populations. Compared with manual gating methods, the current de facto gold standard, these automated methods are expected not only to be faster, but also to increase the reproducibility of data analysis pipelines. According to a very recent comprehensive survey of FCM data analysis methods by Bashashati and Brinkman (1), automated gating methods can be used to identify both known and unknown cell populations, with the latter including the case of subpopulations that cannot be easily identified using two-dimensional manual gating methods. To be able to properly perform the unsupervised automated gating of FCM data, general clustering methods need to fulfill several criteria, such as computational efficiency (to handle the commonly encountered very large data sets in a practically reasonable amount of time), robustness to the shape of the clusters (from spherical to concave cell populations such as “banana-shaped” populations) or the density of the clusters (from very sparse to very dense cell populations, depending on the type of gating to be performed), and ability to identify the (generally unknown) number of populations (1). The article by Aghaeepour et al. (2) published in this issue makes an important contribution to the field by introducing flowMeans, a fast automated gating method based on an extension of k-means clustering. It is important to note that the new method has been specifically developed to address several problematic aspects relating to the application of the k-means clustering to FCM data. These limitations include the identification of the number of populations (k), the sensitivity of the clustering results to the initial values, and implicit restriction to spherical cell populations, with the last limitation particularly relevant to FCM data. The flowMeans method solves the two problems of identifying k and dealing with concave cell populations by starting with a larger number of clusters (by using a “reasonable” upper bound for k) and merging them to allow multiple overlapping clusters to represent the same subpopulation. To go into the specific details of the method, flowMeans starts by estimating the number of modes for each one-dimensional projection of the FCM data, using an approach based on kernel density estimation as described by Duong et al. (3). The total number of modes across all dimensions is used as a maximum for k, given that this sum is an overestimate of the number of subpopulations from the multidimensional space. Because there are more clusters than needed, the resulting clusters have to be further merged to determine the number of subpopulations, and the process is iterated by alternating between calculating the distances between pairs of clusters and merging the closest pair of clusters. The underlying empirical principle is that keeping track of the minimum distance between pairs of clusters will allow us to recognize when we are moving from poorly separated clusters to well separated clusters. A segmented regression algorithm is used to determine k by detecting the change point in the distance between the merged clusters. Given that the flowMeans method is an extension of the k-means clustering specifically designed to deal with FCM data, we will provide further details about k-means and related methods, based on a recent comprehensive review on the topic (4). Although it has been proposed about 55 years ago, due in large part to its simplicity, the k-means algorithm is still one of the most popular clustering algorithms used today. The algorithm involves the minimization of the sum (over all k clusters) of the squared distances between the points in the cluster and the cluster mean, and thus the name k-means. To be able to perform the required minimization, the algorithm requires the advance specification of the number of clusters k, an initial partition in k clusters, and a distance measure. The use of the Euclidian distance generates spherical clusters, whereas the use of the Mahalanobis distance generates ellipsoidal clusters. It is important to note that ensemble clustering methods can be used as an alternative to specifying the number of clusters k and an initial partition in k clusters (5). Multiple data partitions are first obtained by changing the values of k and using several random partitions into k clusters. They are subsequently combined into a final clustering by using the co-occurrence matrix, i.e., the matrix that records the number of times two data points co-occur in the same cluster across the multiple data partitions. Notable extensions of the k-means algorithm include fuzzy c-means, bisecting k-means, X-means, k-medoids, and kernel k-means (4). The last algorithm is of particular importance to FCM data because it can detect clusters of arbitrary shape by being able to describe data distributions more complicated than the Gaussian distribution. Kernel k-means requires the choice of a kernel similarity function, and it performs clustering by maximizing within-cluster similarity (6). In parallel with advances in FCM technology, advances in other high-throughput technologies, such as DNA microarray technology, have also generated large amounts of high-dimensional data requiring automated bioinformatics methods to provide faster and more reproducible data analysis pipelines. Among the methods found to be successful in clustering these types of data are methods based on nonnegative matrix factorization (NMF). Although NMF has been primarily applied for unsupervised clustering in the area of image and natural language processing, NMF-based methods have been also used successfully for molecular pattern discovery, class comparison and prediction, cross-platform and cross-species analysis, functional characterization of genes, and biomedical informatics (see Ref. 7 for a recent review on the use of NMF in computational biology). As the name implies, NMF is a matrix factorization method applicable to matrices containing only nonnegative values. It has been introduced as a parts-based learning paradigm by Lee and Seung (8). Given a nonnegative n × m matrix X, the original method introduced by Lee and Seung uses a multiplicative updates algorithm to find two nonnegative matrices, W and H, of dimensions n × k and k × m, respectively, with k < n, m, such that the matrix product W*H is close to X (with respect to a specified metric). If X is a n × m gene expression matrix consisting of observations on n genes from m samples, each column of W defines a “metagene,” and each column of H represents the metagene “expression pattern” of the corresponding sample (9). NMF can be used for clustering by exploiting the sparseness of the matrix H and using the expression patterns to assign observations to the k components. Further sparsity constraints on the matrix H within the NMF objective function (involving the comparison of X with W*H) reveals the natural clustering aspect of NMF because NMF is essentially equivalent to the k-means algorithm in the limiting case when there is only one nonzero entry per each column of H. Treating the objective function of the k-means method as the objective function of a lower-rank approximation with special constraints allows us to obtain the NMF formulation from the k-means formulation by relaxing some of the constraints. Related to the previous discussion regarding extensions of the k-means, it is important to note that NMF clustering and kernel k-means clustering are closely related because they are different formulations of the same problem with slightly different constraints (10). As such, NMF-based clustering should be able to deal with arbitrarily shaped subpopulations as well. NMF-based clustering methods have performed better than k-means clustering methods when applied to both synthetic and real data (11). These experiments show that an extension of the original NMF method, namely sparse NMF, gives much better and consistent clustering solutions than k-means clustering. A fast alternating nonnegative least squares algorithm was used to obtain NMF and sparse NMF for these comparisons. It is important to note that even without imposing additional sparsity constraints, NMF still can give competitive clustering results (11). Having reviewed k-means and NMF-based clustering methods, we are now in a position to suggest possible modifications of the flowMeans method. One proposal is to replace k-means clustering with sparse NMF-based clustering; the other one is to use robust versions of k-means clustering. To our knowledge, neither of these methods has been used before to cluster FCM data. Regarding the first suggestion, it should be noted that the smallest dimension of the data matrix, the minimum of n and m, provides an upper bound for the number of clusters that can be identified. For FCM data, because the number of cells is very large, this minimum is the number of cell characteristics. If the number of measured characteristics is larger then the expected number of populations (based on biological knowledge) NMF-based clustering methods can be used directly. If not, a two-stage approach may be considered instead. First, NMF can be used as a dimension reduction method to reduce the dimensionality of the data to two or three dimensions, and, then, recently developed nonparametric methods (12) can be used to cluster the resulting data without any restrictions on the shape of the cell populations. Although its current computational implementation is restricted to three-dimensional data, the curvHDR method is a nonparametric density-based approach that uses the concepts of high negative curvature region and high-density region to construct automated gates for FCM data (12). According to the previously referenced review by Bashashati and Brinkman (1), the identification and removal of outliers from subsequent analyses of FCM data is a crucial step in the data analysis pipeline. Here, the outliers are generically defined as observations that are different from the rest of the data, with cell debris, dead cells, and doublets often given as typical examples of outliers for FCM data. Because flowMeans is not designed to be robust to outliers, one possibility will be to replace the k-means clustering method with robust variants. One such robust method is the trimmed k-means clustering method (13), where a known fraction of outliers is trimmed off, and the remaining observations are clustered into k groups. In the absence of a known percent of outliers in the data (a problem similar to the unknown number of clusters), several fractions of trimming can be tried, and the resulting data partitions can be combined using ensemble clustering methods. There is a definite need for the development of more robust methods that will allow the estimation of the fraction of outliers from the data (as opposed to being assumed known) in the absence of strong modeling assumptions. Although the proposed flowMeans method has been shown to be faster and competitive against several alternative methods for automated gating, the use of robust versions of k-means, such as trimmed k-means clustering (13), may provide the needed protection against the undue influence of outliers present in FCM data. Already proven to be a strong competitor of k-means clustering for data generated from other high-throughput technologies, sparse NMF-based clustering methods deserve further investigation with respect to their usefulness in automated gating of FCM data." @default.
- W2024887626 created "2016-06-24" @default.
- W2024887626 creator A5043915305 @default.
- W2024887626 date "2010-12-22" @default.
- W2024887626 modified "2023-10-17" @default.
- W2024887626 title "On extensions of k-means clustering for automated gating of flow cytometry data" @default.
- W2024887626 cites W1902027874 @default.
- W2024887626 cites W1985622505 @default.
- W2024887626 cites W2011430131 @default.
- W2024887626 cites W2114124472 @default.
- W2024887626 cites W2130167132 @default.
- W2024887626 cites W2132692097 @default.
- W2024887626 cites W2136787567 @default.
- W2024887626 cites W2139280638 @default.
- W2024887626 cites W2140095548 @default.
- W2024887626 cites W2145274698 @default.
- W2024887626 cites W2145713585 @default.
- W2024887626 doi "https://doi.org/10.1002/cyto.a.20988" @default.
- W2024887626 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/21182177" @default.
- W2024887626 hasPublicationYear "2010" @default.
- W2024887626 type Work @default.
- W2024887626 sameAs 2024887626 @default.
- W2024887626 citedByCount "12" @default.
- W2024887626 countsByYear W20248876262012 @default.
- W2024887626 countsByYear W20248876262013 @default.
- W2024887626 countsByYear W20248876262014 @default.
- W2024887626 countsByYear W20248876262015 @default.
- W2024887626 countsByYear W20248876262016 @default.
- W2024887626 countsByYear W20248876262023 @default.
- W2024887626 crossrefType "journal-article" @default.
- W2024887626 hasAuthorship W2024887626A5043915305 @default.
- W2024887626 hasConcept C12554922 @default.
- W2024887626 hasConcept C153911025 @default.
- W2024887626 hasConcept C154945302 @default.
- W2024887626 hasConcept C185592680 @default.
- W2024887626 hasConcept C194544171 @default.
- W2024887626 hasConcept C207968372 @default.
- W2024887626 hasConcept C2780339063 @default.
- W2024887626 hasConcept C41008148 @default.
- W2024887626 hasConcept C553184892 @default.
- W2024887626 hasConcept C70721500 @default.
- W2024887626 hasConcept C73555534 @default.
- W2024887626 hasConcept C86803240 @default.
- W2024887626 hasConceptScore W2024887626C12554922 @default.
- W2024887626 hasConceptScore W2024887626C153911025 @default.
- W2024887626 hasConceptScore W2024887626C154945302 @default.
- W2024887626 hasConceptScore W2024887626C185592680 @default.
- W2024887626 hasConceptScore W2024887626C194544171 @default.
- W2024887626 hasConceptScore W2024887626C207968372 @default.
- W2024887626 hasConceptScore W2024887626C2780339063 @default.
- W2024887626 hasConceptScore W2024887626C41008148 @default.
- W2024887626 hasConceptScore W2024887626C553184892 @default.
- W2024887626 hasConceptScore W2024887626C70721500 @default.
- W2024887626 hasConceptScore W2024887626C73555534 @default.
- W2024887626 hasConceptScore W2024887626C86803240 @default.
- W2024887626 hasIssue "1" @default.
- W2024887626 hasLocation W20248876261 @default.
- W2024887626 hasLocation W20248876262 @default.
- W2024887626 hasOpenAccess W2024887626 @default.
- W2024887626 hasPrimaryLocation W20248876261 @default.
- W2024887626 hasRelatedWork W1572687853 @default.
- W2024887626 hasRelatedWork W1966652204 @default.
- W2024887626 hasRelatedWork W2002900699 @default.
- W2024887626 hasRelatedWork W2030130210 @default.
- W2024887626 hasRelatedWork W2090588873 @default.
- W2024887626 hasRelatedWork W2176701861 @default.
- W2024887626 hasRelatedWork W2538538514 @default.
- W2024887626 hasRelatedWork W2758463337 @default.
- W2024887626 hasRelatedWork W3009159775 @default.
- W2024887626 hasRelatedWork W4251194061 @default.
- W2024887626 hasVolume "79A" @default.
- W2024887626 isParatext "false" @default.
- W2024887626 isRetracted "false" @default.
- W2024887626 magId "2024887626" @default.
- W2024887626 workType "article" @default.