kmeans - K-Means Clustering
The k-means algorithm finds centers of clusters and groups input samples around the clusters. It’s one of the most popular clustering algorithms.Function Signature
Parameters
Data for clustering. An array of N-dimensional points with float coordinates.Examples:
Mat points(count, 2, CV_32F)- 2D points as rowsMat points(count, 1, CV_32FC2)- 2D points as single channelMat points(1, count, CV_32FC2)- 2D points as columnsstd::vector<cv::Point2f> points(sampleCount)- vector of points
Number of clusters to split the set by. Must be at least 2.
Input/output integer array that stores the cluster indices for every sample.Each element is in the range [0, K-1] indicating which cluster the sample belongs to.
The algorithm termination criteria: maximum number of iterations and/or desired accuracy.The accuracy is specified as
criteria.epsilon. The algorithm stops when each cluster center moves by less than epsilon.Example: TermCriteria(TermCriteria::MAX_ITER + TermCriteria::EPS, 100, 0.01)Number of times the algorithm is executed using different initial labellings.The algorithm returns the labels that yield the best compactness. Use at least 3 attempts for better results.
Flag specifying the method for center initialization.Flags:
KMEANS_RANDOM_CENTERS(0): Select random initial centers in each attemptKMEANS_PP_CENTERS(2): Use kmeans++ center initialization (recommended)KMEANS_USE_INITIAL_LABELS(1): Use user-supplied labels for first attempt
Output matrix of the cluster centers, one row per each cluster center.Size: K × dimensions
Returns
Type:double
The function returns the compactness measure computed as:
The best (minimum) compactness value is chosen among all attempts, and the corresponding labels and cluster centers are returned.
Example Usage
- C++
- Python
For best results, use
KMEANS_PP_CENTERS flag which implements the kmeans++ initialization algorithm by Arthur and Vassilvitskii. This provides better initial centers than random selection.EM - Expectation Maximization
The Expectation Maximization algorithm implements Gaussian Mixture Models (GMM) for clustering. It models data as a mixture of multiple Gaussian distributions.Creating an EM Model
Key Methods
Sets the number of mixture componentsParameters:
val(int): Number of clusters/mixtures (default: 5)
Sets the constraint on covariance matricesParameters:
val(int): Type of covariance matrices
EM::COV_MAT_SPHERICAL(0): Scaled identity matrix μ_k * IEM::COV_MAT_DIAGONAL(1): Diagonal matrix with positive diagonal elements (recommended)EM::COV_MAT_GENERIC(2): Symmetric positive definite matrix
Sets the termination criteria of the EM algorithmParameters:
val(TermCriteria): Criteria for max iterations or likelihood change
- Maximum number of iterations (M-steps) is reached, OR
- Relative change of likelihood logarithm is less than epsilon
TermCriteria(TermCriteria::MAX_ITER + TermCriteria::EPS, 100, epsilon)Estimates Gaussian mixture parameters from a sample set (Expectation step start)Parameters:
samples(InputArray): Samples from which GMM will be estimated (CV_64F or will be converted)logLikelihoods(OutputArray): Optional output matrix of likelihood logarithm valueslabels(OutputArray): Optional output “class label” for each sampleprobs(OutputArray): Optional posterior probabilities matrix (nsamples × nclusters)
Estimates Gaussian mixture parameters with initial means providedParameters:
samples(InputArray): Training samples matrixmeans0(InputArray): Initial means of mixture components (nclusters × dims)covs0(InputArray): Optional initial covariance matricesweights0(InputArray): Optional initial weights of mixture componentslogLikelihoods(OutputArray): Optional likelihood logarithm outputlabels(OutputArray): Optional cluster labels outputprobs(OutputArray): Optional posterior probabilities output
Estimates Gaussian mixture parameters starting with Maximization stepParameters:
samples(InputArray): Training samplesprobs0(InputArray): Initial probabilitieslogLikelihoods(OutputArray): Optional likelihood outputlabels(OutputArray): Optional labels outputprobs(OutputArray): Optional probabilities output
Returns posterior probabilities for the provided samplesParameters:
samples(InputArray): Input samples matrixresults(OutputArray): Optional output matrix of results (nsamples × nclusters)flags(int): Optional flags (ignored)
Returns likelihood logarithm value and index of most probable mixture componentParameters:
sample(InputArray): A sample for classification (1 × dims or dims × 1)probs(OutputArray): Optional posterior probabilities (1 × nclusters, CV_64FC1)
- Element [0]: Likelihood logarithm value
- Element [1]: Index of most probable mixture component
Returns weights of the mixturesReturns: Mat - vector with number of elements equal to number of mixtures
Returns the cluster centers (means of the Gaussian mixture)Returns: Mat - matrix with rows = number of mixtures, cols = space dimensionality
Returns covariance matricesParameters:
covs(std::vector<Mat>&): Output vector of covariance matrices
Example Usage
- C++
- Python
When to Use EM vs K-Means
Use EM when:- Clusters have different shapes and sizes
- You need probabilistic cluster assignments
- Data follows Gaussian distributions
- You want to model uncertainty in cluster membership
- Clusters are roughly spherical and similar in size
- You need hard cluster assignments
- Speed is critical (K-means is faster)
- You have very large datasets
The EM algorithm is more flexible than k-means as it can model elliptical clusters with different orientations and sizes. However, it’s more computationally expensive and requires more samples for reliable estimation.
See Also
- Classification Algorithms - SVM, KNN, Decision Trees, and more
- Regression Algorithms - Linear and logistic regression methods
- TrainData Class - Managing training data
