sklekmeans.MiniBatchEKMeans#

class sklekmeans.MiniBatchEKMeans(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Mini-batch Equilibrium K-Means.

Scalable mini-batch optimisation of the equilibrium k-means objective supporting both an accumulation scheme (learning_rate=None) and an online exponential moving average update scheme.

Parameters:

n_clusters (int, default=8) – The number of clusters to form / centers to learn.
metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used for batch distance computations. Manhattan can be more robust to certain outliers but is slower than vectorised Euclidean.
alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter controlling sharpness of the soft membership distribution prior to equilibrium correction. If 'dvariance' a heuristic value is derived from a subsample of the data (see init_size) using scale / mean(d^2) where d^2 are squared distances to the subsample mean.
scale (float, default=2.0) – Multiplicative factor in the 'dvariance' heuristic. Larger values produce larger effective alpha leading to crisper initial memberships.
batch_size (int, default=256) – Number of samples per mini-batch. A larger batch size reduces variance of updates but increases per-step cost and memory.
max_epochs (int, default=10) – Maximum number of full passes (epochs) over the training data.
n_init (int, default=1) – Number of random initialisations. The algorithm will run mini-batch optimisation n_init times with different seeds (derived from random_state) and keep the run with the lowest internal equilibrium objective (evaluated on the full dataset), which improves robustness to local minima.
init ({'k-means', 'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') –
Initialization method.
- ’k-means’: run a short standard k-means (Euclidean) to obtain
  initial centers. When using a non-Euclidean metric, this serves as a heuristic seeding.
- ’k-means++’: probabilistic seeding adapted for chosen metric.
- ’random’: choose n_clusters observations without replacement.
- ndarray: user-specified initial centers with shape
  (n_clusters, n_features).
init_size (int or None, default=None) – Subsample size used to estimate the 'dvariance' heuristic. If None a size based on max(10 * n_clusters, batch_size) is used (capped at n_samples). Ignored when alpha is a numeric value.
shuffle (bool, default=True) – Whether to shuffle sample order at the beginning of each epoch. Recommended for i.i.d. data to decorrelate batches.
learning_rate (float or None, default=None) –
If None use accumulation mode (centers are the weighted average of all processed batches). If a positive float, perform online exponential moving average updates:
```
C_k <- (1 - lr) * C_k + lr * xbar_k
```
where xbar_k is the weighted mean of cluster k in the current batch.
tol (float, default=1e-4) – Convergence tolerance on Frobenius norm of center change scaled by average feature variance of the dataset (computed once).
reassignment_ratio (float, default=0.0) – Minimum fraction of batch weight a cluster must receive to be updated. Clusters not meeting the threshold accumulate a patience counter (see reassign_patience).
reassign_patience (int, default=3) – Number of consecutive batches a cluster can fail the reassignment_ratio threshold before it is forcibly reassigned to a far point in the current batch.
verbose (int, default=0) – Verbosity level. 0 is silent; higher values print epoch diagnostics every print_every epochs.
monitor_size (int or None, default=1024) – Size of a subsample used to compute an approximate objective for monitoring (stored in objective_approx_). If None the full dataset is used (higher cost).
print_every (int, default=1) – Frequency (in epochs) at which progress messages are printed when verbose > 0.
use_numba (bool, default=False) – If True and numba is installed use a JIT-compiled kernel for the equilibrium weight computation.
numba_threads (int or None, default=None) – Number of threads to request from numba’s threading layer (if available). Ignored when numba is not installed or use_numba=False.
random_state (int, RandomState instance or None, default=None) – Controls reproducibility of center initialisation, alpha heuristic subsampling and shuffling. Pass an int for deterministic behaviour.

cluster_centers_#

Final centers.

Type:: ndarray of shape (n_clusters, n_features)

labels_#

Hard assignment labels for the training data (available only after calling fit()). Not updated by partial_fit().

Type:: ndarray of shape (n_samples,)

alpha_#

Resolved alpha value.

Type:: float

objective_approx_#

Epoch-wise approximate objectives.

Type:: list of float

counts_#

Accumulated weights (accumulation mode).

Type:: ndarray of shape (n_clusters,)

sums_#

Accumulated weighted sums (accumulation mode).

Type:: ndarray of shape (n_clusters, n_features)

W_, U_

Final weights and memberships for the full training data (if fit).

Type:: ndarrays

n_features_in_#

Number of features seen during the first call to fit() or partial_fit(). Ensures consistent dimensionality across incremental updates and predictions.

Type:: int

Notes

The approximate objective is tracked on a monitoring subset when monitor_size is not None and stored in objective_approx_.

Examples

>>> from sklekmeans import MiniBatchEKMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 0], [4, 4],
...               [4, 5], [0, 1], [2, 2],
...               [3, 2], [5, 5], [1, -1]])
>>> # manually fit on batches
>>> ekmeans = MiniBatchEKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6)
>>> ekmeans = ekmeans.partial_fit(X[0:6,:])
>>> ekmeans = ekmeans.partial_fit(X[6:12,:])
>>> ekmeans.cluster_centers_
array([[3.52095093, 3.04647593],
      [0.74811989, 0.65697575]])
>>> ekmeans.predict([[0, 0], [4, 4]])
array([1, 0])
>>> # fit on the whole data
>>> ekmeans = MiniBatchEKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          max_epochs=10).fit(X)
>>> ekmeans.cluster_centers_
array([[1.53817648, 0.5856779],
   [3.45670363, 3.73923965]])
>>> ekmeans.predict([[0, 0], [4, 4]])
array([1, 0])

__init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Methods

`__init__`([n_clusters, metric, alpha, scale, ...])
`fit`(X[, y])	Train the mini-batch equilibrium k-means estimator.
`fit_membership`(X[, y])	Fit to `X` and return the final membership matrix for training data.
`fit_predict`(X[, y])	Fit to `X` and return hard assignments.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`membership`(X)	Compute soft membership matrix for samples in `X`.
`partial_fit`(X_batch[, y])	Incrementally update the model with a single mini-batch.
`predict`(X)	Assign each sample in `X` to the closest learned center.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`set_partial_fit_request`(*[, X_batch])	Configure whether metadata should be requested to be passed to the `partial_fit` method.
`transform`(X)	Compute distances from samples to cluster centers.

fit(X, y=None)#

Train the mini-batch equilibrium k-means estimator.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – For API consistency.

Returns:

self – Fitted estimator.

Return type:

object

fit_membership(X, y=None)#

Fit to X and return the final membership matrix for training data.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (Ignored) – Present for API consistency.

Returns:

U – Membership matrix U_ computed on the training data.

Return type:

ndarray of shape (n_samples, n_clusters)

fit_predict(X, y=None)#

Fit to X and return hard assignments.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.

Returns:

labels – Hard cluster assignments for the input samples.

Return type:

ndarray of shape (n_samples,)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

membership(X)#

Compute soft membership matrix for samples in X.

Memberships use the fitted alpha_ and a row-wise normalization of exp(-alpha * d^2_shift) where d^2_shift subtracts each row’s minimum for numerical stability.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples for which to compute memberships.
Returns:: U – Row-stochastic membership matrix (rows sum to 1).
Return type:: ndarray of shape (n_samples, n_clusters)

partial_fit(X_batch, y=None)#

Incrementally update the model with a single mini-batch.

Parameters:

X_batch (array-like of shape (batch_size, n_features)) – Mini-batch of samples.
y (Ignored) – For API consistency.

Returns:

self – Updated estimator.

Return type:

object

predict(X)#

Assign each sample in X to the closest learned center.

Parameters:: X (array-like of shape (n_samples, n_features)) – New samples to assign.
Returns:: labels – Indices of the nearest centers under the configured metric.
Return type:: ndarray of shape (n_samples,)

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_partial_fit_request(*, X_batch: bool | None | str = '$UNCHANGED$') → MiniBatchEKMeans#

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to partial_fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: X_batch (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_batch parameter in partial_fit.
Returns:: self – The updated object.
Return type:: object

transform(X)#

Compute distances from samples to cluster centers.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to transform.
Returns:: distances – Pairwise distances to cluster_centers_ using the estimator’s metric (‘euclidean’ or ‘manhattan’). For Euclidean, distances are non-squared.
Return type:: ndarray of shape (n_samples, n_clusters)

sklekmeans.MiniBatchEKMeans#

This Page