sklekmeans.MiniBatchEKMeans#

class sklekmeans.MiniBatchEKMeans(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Mini-batch Equilibrium K-Means.

Scalable mini-batch optimisation of the equilibrium k-means objective supporting both an accumulation scheme (learning_rate=None) and an online exponential moving average update scheme.

Parameters:
  • n_clusters (int, default=8) – The number of clusters to form / centers to learn.

  • metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used for batch distance computations. Manhattan can be more robust to certain outliers but is slower than vectorised Euclidean.

  • alpha (float or {'dvariance'}, default=0.5) – Equilibrium weighting parameter controlling sharpness of the soft membership distribution prior to equilibrium correction. If 'dvariance' a heuristic value is derived from a subsample of the data (see init_size) using scale / mean(d^2) where d^2 are squared distances to the subsample mean.

  • scale (float, default=2.0) – Multiplicative factor in the 'dvariance' heuristic. Larger values produce larger effective alpha leading to crisper initial memberships.

  • batch_size (int, default=256) – Number of samples per mini-batch. A larger batch size reduces variance of updates but increases per-step cost and memory.

  • max_epochs (int, default=10) – Maximum number of full passes (epochs) over the training data.

  • n_init (int, default=1) – Number of random initialisations. The algorithm will run mini-batch optimisation n_init times with different seeds (derived from random_state) and keep the run with the lowest internal equilibrium objective (evaluated on the full dataset), which improves robustness to local minima.

  • init ({'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') – Initialization method. * ‘k-means++’ : probabilistic seeding adapted for chosen metric. * ‘random’ : choose n_clusters observations without replacement. * ndarray : user-specified initial centers.

  • init_size (int or None, default=None) – Subsample size used to estimate the 'dvariance' heuristic. If None a size based on max(10 * n_clusters, batch_size) is used (capped at n_samples). Ignored when alpha is a numeric value.

  • shuffle (bool, default=True) – Whether to shuffle sample order at the beginning of each epoch. Recommended for i.i.d. data to decorrelate batches.

  • learning_rate (float or None, default=None) –

    If None use accumulation mode (centers are the weighted average of all processed batches). If a positive float, perform online exponential moving average updates:

    C_k <- (1 - lr) * C_k + lr * xbar_k
    

    where xbar_k is the weighted mean of cluster k in the current batch.

  • tol (float, default=1e-4) – Convergence tolerance on Frobenius norm of center change scaled by average feature variance of the dataset (computed once).

  • reassignment_ratio (float, default=0.0) – Minimum fraction of batch weight a cluster must receive to be updated. Clusters not meeting the threshold accumulate a patience counter (see reassign_patience).

  • reassign_patience (int, default=3) – Number of consecutive batches a cluster can fail the reassignment_ratio threshold before it is forcibly reassigned to a far point in the current batch.

  • verbose (int, default=0) – Verbosity level. 0 is silent; higher values print epoch diagnostics every print_every epochs.

  • monitor_size (int or None, default=1024) – Size of a subsample used to compute an approximate objective for monitoring (stored in objective_approx_). If None the full dataset is used (higher cost).

  • print_every (int, default=1) – Frequency (in epochs) at which progress messages are printed when verbose > 0.

  • use_numba (bool, default=False) – If True and numba is installed use a JIT-compiled kernel for the equilibrium weight computation.

  • numba_threads (int or None, default=None) – Number of threads to request from numba’s threading layer (if available). Ignored when numba is not installed or use_numba=False.

  • random_state (int, RandomState instance or None, default=None) – Controls reproducibility of center initialisation, alpha heuristic subsampling and shuffling. Pass an int for deterministic behaviour.

cluster_centers_#

Final centers.

Type:

ndarray of shape (n_clusters, n_features)

labels_#

Hard assignment labels for the training data (available only after calling fit()). Not updated by partial_fit().

Type:

ndarray of shape (n_samples,)

alpha_#

Resolved alpha value.

Type:

float

objective_approx_#

Epoch-wise approximate objectives.

Type:

list of float

counts_#

Accumulated weights (accumulation mode).

Type:

ndarray of shape (n_clusters,)

sums_#

Accumulated weighted sums (accumulation mode).

Type:

ndarray of shape (n_clusters, n_features)

W_, U_

Final weights and memberships for the full training data (if fit).

Type:

ndarrays

n_features_in_#

Number of features seen during the first call to fit() or partial_fit(). Ensures consistent dimensionality across incremental updates and predictions.

Type:

int

fit(X, y=None)#

Run full mini-batch training until convergence or max epochs.

partial_fit(X_batch, y=None)#

Update model parameters using a single mini-batch.

predict(X)#

Return hard cluster labels for samples.

transform(X)#

Return distances from samples to cluster centers.

membership(X)#

Compute soft membership (row-normalized responsibilities).

fit_predict(X, y=None)#

Fit the model and return hard labels for X.

fit_membership(X, y=None)#

Fit the model and return the membership matrix for the training data.

(Internal helpers: `_init_centers`, `_resolve_alpha`, `_calc_weight`, `_approx_objective` are internal implementation details.)

Notes

The approximate objective is tracked on a monitoring subset when monitor_size is not None and stored in objective_approx_.

Examples

>>> from sklekmeans import MiniBatchEKMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 0], [4, 4],
...               [4, 5], [0, 1], [2, 2],
...               [3, 2], [5, 5], [1, -1]])
>>> # manually fit on batches
>>> ekmeans = MiniBatchEKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6)
>>> ekmeans = ekmeans.partial_fit(X[0:6,:])
>>> ekmeans = ekmeans.partial_fit(X[6:12,:])
>>> ekmeans.cluster_centers_
array([[3.47914144, 3.02885195],
      [0.73800796, 0.61514045]])
>>> ekmeans.predict([[0, 0], [4, 4]])
array([1, 0])
>>> # fit on the whole data
>>> ekmeans = MiniBatchEKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          max_epochs=10).fit(X)
>>> ekmeans.cluster_centers_
array([[3.51549642, 4.53433897],
   [1.98848002, 0.97403648]])
>>> ekmeans.predict([[0, 0], [4, 4]])
array([1, 0])
__init__(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Methods

__init__([n_clusters, metric, alpha, scale, ...])

fit(X[, y])

Train the mini-batch equilibrium k-means estimator.

fit_membership(X[, y])

Fit to X and return the final membership matrix for training data.

fit_predict(X[, y])

Fit to X and return hard assignments.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

membership(X)

Compute soft membership matrix for samples in X.

partial_fit(X_batch[, y])

Incrementally update the model with a single mini-batch.

predict(X)

Assign each sample in X to the closest learned center.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

set_partial_fit_request(*[, X_batch])

Configure whether metadata should be requested to be passed to the partial_fit method.

transform(X)

Compute distances from samples to cluster centers.

fit(X, y=None)#

Train the mini-batch equilibrium k-means estimator.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – For API consistency.

Returns:

self – Fitted estimator.

Return type:

object

fit_membership(X, y=None)#

Fit to X and return the final membership matrix for training data.

fit_predict(X, y=None)#

Fit to X and return hard assignments.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

membership(X)#

Compute soft membership matrix for samples in X.

partial_fit(X_batch, y=None)#

Incrementally update the model with a single mini-batch.

Parameters:
  • X_batch (array-like of shape (batch_size, n_features)) – Mini-batch of samples.

  • y (Ignored) – For API consistency.

Returns:

self – Updated estimator.

Return type:

object

predict(X)#

Assign each sample in X to the closest learned center.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_partial_fit_request(*, X_batch: bool | None | str = '$UNCHANGED$') MiniBatchEKMeans#

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to partial_fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_batch (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_batch parameter in partial_fit.

Returns:

self – The updated object.

Return type:

object

transform(X)#

Compute distances from samples to cluster centers.