sklekmeans.MiniBatchEKMeans#
- class sklekmeans.MiniBatchEKMeans(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#
Mini-batch Equilibrium K-Means.
Scalable mini-batch optimisation of the equilibrium k-means objective supporting both an accumulation scheme (
learning_rate=None) and an online exponential moving average update scheme.- Parameters:
n_clusters (int, default=8) – The number of clusters to form / centers to learn.
metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used for batch distance computations. Manhattan can be more robust to certain outliers but is slower than vectorised Euclidean.
alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter controlling sharpness of the soft membership distribution prior to equilibrium correction. If
'dvariance'a heuristic value is derived from a subsample of the data (seeinit_size) usingscale / mean(d^2)whered^2are squared distances to the subsample mean.scale (float, default=2.0) – Multiplicative factor in the
'dvariance'heuristic. Larger values produce larger effectivealphaleading to crisper initial memberships.batch_size (int, default=256) – Number of samples per mini-batch. A larger batch size reduces variance of updates but increases per-step cost and memory.
max_epochs (int, default=10) – Maximum number of full passes (epochs) over the training data.
n_init (int, default=1) – Number of random initialisations. The algorithm will run mini-batch optimisation
n_inittimes with different seeds (derived fromrandom_state) and keep the run with the lowest internal equilibrium objective (evaluated on the full dataset), which improves robustness to local minima.init ({'k-means', 'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') –
Initialization method.
- ’k-means’: run a short standard k-means (Euclidean) to obtain
initial centers. When using a non-Euclidean metric, this serves as a heuristic seeding.
’k-means++’: probabilistic seeding adapted for chosen metric.
’random’: choose
n_clustersobservations without replacement.- ndarray: user-specified initial centers with shape
(n_clusters, n_features).
init_size (int or None, default=None) – Subsample size used to estimate the
'dvariance'heuristic. IfNonea size based onmax(10 * n_clusters, batch_size)is used (capped atn_samples). Ignored whenalphais a numeric value.shuffle (bool, default=True) – Whether to shuffle sample order at the beginning of each epoch. Recommended for i.i.d. data to decorrelate batches.
learning_rate (float or None, default=None) –
If
Noneuse accumulation mode (centers are the weighted average of all processed batches). If a positive float, perform online exponential moving average updates:C_k <- (1 - lr) * C_k + lr * xbar_k
where
xbar_kis the weighted mean of clusterkin the current batch.tol (float, default=1e-4) – Convergence tolerance on Frobenius norm of center change scaled by average feature variance of the dataset (computed once).
reassignment_ratio (float, default=0.0) – Minimum fraction of batch weight a cluster must receive to be updated. Clusters not meeting the threshold accumulate a patience counter (see
reassign_patience).reassign_patience (int, default=3) – Number of consecutive batches a cluster can fail the
reassignment_ratiothreshold before it is forcibly reassigned to a far point in the current batch.verbose (int, default=0) – Verbosity level.
0is silent; higher values print epoch diagnostics everyprint_everyepochs.monitor_size (int or None, default=1024) – Size of a subsample used to compute an approximate objective for monitoring (stored in
objective_approx_). IfNonethe full dataset is used (higher cost).print_every (int, default=1) – Frequency (in epochs) at which progress messages are printed when
verbose > 0.use_numba (bool, default=False) – If
Trueandnumbais installed use a JIT-compiled kernel for the equilibrium weight computation.numba_threads (int or None, default=None) – Number of threads to request from numba’s threading layer (if available). Ignored when numba is not installed or
use_numba=False.random_state (int, RandomState instance or None, default=None) – Controls reproducibility of center initialisation, alpha heuristic subsampling and shuffling. Pass an int for deterministic behaviour.
- cluster_centers_#
Final centers.
- Type:
ndarray of shape (n_clusters, n_features)
- labels_#
Hard assignment labels for the training data (available only after calling
fit()). Not updated bypartial_fit().- Type:
ndarray of shape (n_samples,)
- counts_#
Accumulated weights (accumulation mode).
- Type:
ndarray of shape (n_clusters,)
- sums_#
Accumulated weighted sums (accumulation mode).
- Type:
ndarray of shape (n_clusters, n_features)
- W_, U_
Final weights and memberships for the full training data (if
fit).- Type:
ndarrays
- n_features_in_#
Number of features seen during the first call to
fit()orpartial_fit(). Ensures consistent dimensionality across incremental updates and predictions.- Type:
Notes
The approximate objective is tracked on a monitoring subset when
monitor_sizeis notNoneand stored inobjective_approx_.Examples
>>> from sklekmeans import MiniBatchEKMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 0], [4, 4], ... [4, 5], [0, 1], [2, 2], ... [3, 2], [5, 5], [1, -1]]) >>> # manually fit on batches >>> ekmeans = MiniBatchEKMeans(n_clusters=2, ... random_state=0, ... batch_size=6) >>> ekmeans = ekmeans.partial_fit(X[0:6,:]) >>> ekmeans = ekmeans.partial_fit(X[6:12,:]) >>> ekmeans.cluster_centers_ array([[3.52095093, 3.04647593], [0.74811989, 0.65697575]]) >>> ekmeans.predict([[0, 0], [4, 4]]) array([1, 0]) >>> # fit on the whole data >>> ekmeans = MiniBatchEKMeans(n_clusters=2, ... random_state=0, ... batch_size=6, ... max_epochs=10).fit(X) >>> ekmeans.cluster_centers_ array([[1.53817648, 0.5856779], [3.45670363, 3.73923965]]) >>> ekmeans.predict([[0, 0], [4, 4]]) array([1, 0])
- __init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#
Methods
__init__([n_clusters, metric, alpha, scale, ...])fit(X[, y])Train the mini-batch equilibrium k-means estimator.
fit_membership(X[, y])Fit to
Xand return the final membership matrix for training data.fit_predict(X[, y])Fit to
Xand return hard assignments.fit_transform(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
membership(X)Compute soft membership matrix for samples in
X.partial_fit(X_batch[, y])Incrementally update the model with a single mini-batch.
predict(X)Assign each sample in
Xto the closest learned center.set_output(*[, transform])Set output container.
set_params(**params)Set the parameters of this estimator.
set_partial_fit_request(*[, X_batch])Configure whether metadata should be requested to be passed to the
partial_fitmethod.transform(X)Compute distances from samples to cluster centers.
- fit(X, y=None)#
Train the mini-batch equilibrium k-means estimator.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – For API consistency.
- Returns:
self – Fitted estimator.
- Return type:
- fit_membership(X, y=None)#
Fit to
Xand return the final membership matrix for training data.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (Ignored) – Present for API consistency.
- Returns:
U – Membership matrix
U_computed on the training data.- Return type:
ndarray of shape (n_samples, n_clusters)
- fit_predict(X, y=None)#
Fit to
Xand return hard assignments.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
- Returns:
labels – Hard cluster assignments for the input samples.
- Return type:
ndarray of shape (n_samples,)
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- membership(X)#
Compute soft membership matrix for samples in
X.Memberships use the fitted
alpha_and a row-wise normalization ofexp(-alpha * d^2_shift)whered^2_shiftsubtracts each row’s minimum for numerical stability.- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples for which to compute memberships.
- Returns:
U – Row-stochastic membership matrix (rows sum to 1).
- Return type:
ndarray of shape (n_samples, n_clusters)
- partial_fit(X_batch, y=None)#
Incrementally update the model with a single mini-batch.
- Parameters:
X_batch (array-like of shape (batch_size, n_features)) – Mini-batch of samples.
y (Ignored) – For API consistency.
- Returns:
self – Updated estimator.
- Return type:
- predict(X)#
Assign each sample in
Xto the closest learned center.- Parameters:
X (array-like of shape (n_samples, n_features)) – New samples to assign.
- Returns:
labels – Indices of the nearest centers under the configured metric.
- Return type:
ndarray of shape (n_samples,)
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of
transformandfit_transform."default": Default output format of a transformer"pandas": DataFrame output"polars": Polars outputNone: Transform configuration is unchanged
Added in version 1.4:
"polars"option was added.- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_partial_fit_request(*, X_batch: bool | None | str = '$UNCHANGED$') MiniBatchEKMeans#
Configure whether metadata should be requested to be passed to the
partial_fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topartial_fitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topartial_fit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- transform(X)#
Compute distances from samples to cluster centers.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to transform.
- Returns:
distances – Pairwise distances to
cluster_centers_using the estimator’smetric(‘euclidean’ or ‘manhattan’). For Euclidean, distances are non-squared.- Return type:
ndarray of shape (n_samples, n_clusters)