Skip to main content

KMeans

K-Means partitions nn samples into KK clusters by minimizing the within-cluster sum of squared distances (inertia).

Objective Function

min{Ck,μk}k=1KxiCkxiμk22\min_{\{C_k, \mu_k\}} \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|_2^2

where μk=1CkxiCkxi\mu_k = \frac{1}{|C_k|}\sum_{x_i \in C_k} x_i is the centroid of cluster CkC_k.

Lloyd's Algorithm

K-Means uses Lloyd's algorithm, which alternates two steps until convergence:

  1. Assignment: Assign each sample to the cluster with the nearest centroid: Ck={xi:xiμkxiμj  j}C_k = \{x_i : \|x_i - \mu_k\| \le \|x_i - \mu_j\| \;\forall\, j\}
  2. Update: Recompute each centroid as the mean of its assigned samples: μk1CkxiCkxi\mu_k \leftarrow \frac{1}{|C_k|}\sum_{x_i \in C_k} x_i

The algorithm terminates when assignments no longer change or max_iter is reached. The result is a local minimum — n_init independent runs with different initializations are performed, and the solution with the lowest inertia is kept.

k-means++ Initialization

To avoid poor local minima, Skigen uses the k-means++ initialization strategy:

  1. Choose the first centroid μ1\mu_1 uniformly at random from the data.
  2. For each subsequent centroid μk\mu_k, sample a data point xix_i with probability proportional to D(xi)2D(x_i)^2, where D(xi)D(x_i) is the distance from xix_i to its nearest existing centroid.

This initialization spreads the initial centroids apart, leading to faster convergence and better final solutions (Arthur & Vassilvitskii, 2007).

Computational Complexity

Each iteration of Lloyd's algorithm costs O(nKp)O(nKp), where nn is the number of samples, KK is the number of clusters, and pp is the number of features.

Mirrors sklearn.cluster.KMeans.

Constructor

Skigen::KMeans<Scalar> km(int n_clusters = 8, int max_iter = 300,
int n_init = 10, unsigned int random_state = 42);
ParameterDefaultDescription
n_clusters8Number of clusters KK
max_iter300Maximum Lloyd iterations per run
n_init10Number of independent initializations (best inertia kept)
random_state42Random seed for reproducibility

Methods

MethodDescription
fit(X)Compute K-Means clustering
predict(X)Assign each sample to the nearest centroid
transform(X)Transform to cluster-distance space (n×Kn \times K matrix)
fit_predict(X)Fit and return cluster labels

Fitted Attributes

AccessorTypeDescription
cluster_centers()MatrixTypeCentroid coordinates (K×pK \times p)
labels()IndexVectorCluster labels of training data
inertia()ScalarSum of squared distances to centroids
n_iter()intNumber of iterations in the best run

Example

#include <Skigen/Cluster>
#include <Eigen/Dense>
#include <iostream>

int main() {
Eigen::MatrixXd X = Eigen::MatrixXd::Random(200, 3);

Skigen::KMeans km(4);
km.fit(X);

std::cout << "Inertia: " << km.inertia() << "\n";
std::cout << "Centers:\n" << km.cluster_centers() << "\n";

auto labels = km.predict(X);
}

References

  • Arthur, D. and Vassilvitskii, S. (2007). "k-means++: The advantages of careful seeding." Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–1035.