Skip to main content

MiniBatchKMeans

#include <Skigen/Cluster>

template <typename Scalar = double>
class Skigen::MiniBatchKMeans(n_clusters=8, batch_size=100, max_iter=100, random_state=42)

Mini-Batch K-Means clustering.

Alternative online implementation of KMeans that uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

Mirrors sklearn.cluster.MiniBatchKMeans.


Parameters:

  • n_clusters : int, default=8 The number of clusters (int, default 8).

  • batch_size : int, default=100 Size of the mini batches (int, default 100).

  • max_iter : int, default=100 Maximum iterations (int, default 100).

  • random_state : unsigned int, default=42 RNG seed (unsigned int, default 42).


Attributes:

  • n_clusters : int Whether the estimator has been fitted.

  • cluster_centers : MatrixType Cluster centers (n_clusters × n_features).

  • labels : IndexVector Labels of each training point.

  • inertia : Scalar Sum of squared distances to closest cluster center.


Methods

SKIGEN_PARAMS()

Fit the MiniBatchKMeans model.

Uses k-means++ initialization on the first batch, then performs mini-batch stochastic updates to cluster centers.

Parameters:

  • X Training data of shape (n_samples, n_features).

Returns:

  • result Reference to the fitted estimator (*this).

Throws:

  • std::invalid_argument — if n_samples < n_clusters.

fit_predict(X)

Fit mini-batch k-means and return labels for the training data.


partial_fit(X)

Online update of the cluster centers from a single batch.

Mirrors sklearn's MiniBatchKMeans.partial_fit. The first call initialises centers via k-means++ on the supplied batch (which must therefore contain at least n_clusters samples); subsequent calls perform a single streaming pass over X, updating each assigned center's running mean.

Unlike fit, partial_fit does not populate labels_ or inertia_ (matching sklearn behaviour — those attributes refer to the last fit call only).

Parameters:

  • X : MatrixType Batch of training data, shape (n_samples_batch, n_features).

Returns:

  • result : MiniBatchKMeans Reference to the fitted estimator (*this).

Throws:

  • std::invalid_argument — on first call if n_samples_batch < n_clusters, or on subsequent calls if the feature count differs.

partial_fit(X)

Online update from a sparse mini-batch (see dense overload).

Mirrors sklearn's MiniBatchKMeans.partial_fit on sparse input. Each row's nearest centroid is found via the x2+c22x ⁣ ⁣c\|x\|^2 + \|c\|^2 - 2\,x\!\cdot\!c expansion (sparse dot, dense centroid); the assigned centroid is then updated by a running mean using cluster_counts_.


predict(X)

Predict the closest cluster each sample belongs to.

Parameters:

  • X : MatrixType New data of shape (n_samples, n_features).

Returns:

  • result : IndexVector Index of the closest cluster for each sample.

Throws:

  • std::runtime_error — if the model has not been fitted.

Example

// MiniBatchKMeans — faster for large datasets
Skigen::MiniBatchKMeans<double> mbk(3, /*batch_size=*/30, /*max_iter=*/100, /*random_state=*/42);
mbk.fit(X);

std::cout << "=== MiniBatchKMeans (k=3, batch=30) ===\n";
std::cout << "Inertia: " << mbk.inertia() << "\n";
std::cout << "Centers:\n" << mbk.cluster_centers() << "\n";