sklekmeans.MiniBatchEKMeans#
- class sklekmeans.MiniBatchEKMeans(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#
Mini-batch Equilibrium K-Means.
Scalable mini-batch optimisation of the equilibrium k-means objective supporting both an accumulation scheme (
learning_rate=None
) and an online exponential moving average update scheme.- Parameters:
n_clusters (int, default=8) – The number of clusters to form / centers to learn.
metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used for batch distance computations. Manhattan can be more robust to certain outliers but is slower than vectorised Euclidean.
alpha (float or {'dvariance'}, default=0.5) – Equilibrium weighting parameter controlling sharpness of the soft membership distribution prior to equilibrium correction. If
'dvariance'
a heuristic value is derived from a subsample of the data (seeinit_size
) usingscale / mean(d^2)
whered^2
are squared distances to the subsample mean.scale (float, default=2.0) – Multiplicative factor in the
'dvariance'
heuristic. Larger values produce larger effectivealpha
leading to crisper initial memberships.batch_size (int, default=256) – Number of samples per mini-batch. A larger batch size reduces variance of updates but increases per-step cost and memory.
max_epochs (int, default=10) – Maximum number of full passes (epochs) over the training data.
n_init (int, default=1) – Number of random initialisations. The algorithm will run mini-batch optimisation
n_init
times with different seeds (derived fromrandom_state
) and keep the run with the lowest internal equilibrium objective (evaluated on the full dataset), which improves robustness to local minima.init ({'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') – Initialization method. * ‘k-means++’ : probabilistic seeding adapted for chosen metric. * ‘random’ : choose
n_clusters
observations without replacement. * ndarray : user-specified initial centers.init_size (int or None, default=None) – Subsample size used to estimate the
'dvariance'
heuristic. IfNone
a size based onmax(10 * n_clusters, batch_size)
is used (capped atn_samples
). Ignored whenalpha
is a numeric value.shuffle (bool, default=True) – Whether to shuffle sample order at the beginning of each epoch. Recommended for i.i.d. data to decorrelate batches.
learning_rate (float or None, default=None) –
If
None
use accumulation mode (centers are the weighted average of all processed batches). If a positive float, perform online exponential moving average updates:C_k <- (1 - lr) * C_k + lr * xbar_k
where
xbar_k
is the weighted mean of clusterk
in the current batch.tol (float, default=1e-4) – Convergence tolerance on Frobenius norm of center change scaled by average feature variance of the dataset (computed once).
reassignment_ratio (float, default=0.0) – Minimum fraction of batch weight a cluster must receive to be updated. Clusters not meeting the threshold accumulate a patience counter (see
reassign_patience
).reassign_patience (int, default=3) – Number of consecutive batches a cluster can fail the
reassignment_ratio
threshold before it is forcibly reassigned to a far point in the current batch.verbose (int, default=0) – Verbosity level.
0
is silent; higher values print epoch diagnostics everyprint_every
epochs.monitor_size (int or None, default=1024) – Size of a subsample used to compute an approximate objective for monitoring (stored in
objective_approx_
). IfNone
the full dataset is used (higher cost).print_every (int, default=1) – Frequency (in epochs) at which progress messages are printed when
verbose > 0
.use_numba (bool, default=False) – If
True
andnumba
is installed use a JIT-compiled kernel for the equilibrium weight computation.numba_threads (int or None, default=None) – Number of threads to request from numba’s threading layer (if available). Ignored when numba is not installed or
use_numba=False
.random_state (int, RandomState instance or None, default=None) – Controls reproducibility of center initialisation, alpha heuristic subsampling and shuffling. Pass an int for deterministic behaviour.
- cluster_centers_#
Final centers.
- Type:
ndarray of shape (n_clusters, n_features)
- labels_#
Hard assignment labels for the training data (available only after calling
fit()
). Not updated bypartial_fit()
.- Type:
ndarray of shape (n_samples,)
- counts_#
Accumulated weights (accumulation mode).
- Type:
ndarray of shape (n_clusters,)
- sums_#
Accumulated weighted sums (accumulation mode).
- Type:
ndarray of shape (n_clusters, n_features)
- W_, U_
Final weights and memberships for the full training data (if
fit
).- Type:
ndarrays
- n_features_in_#
Number of features seen during the first call to
fit()
orpartial_fit()
. Ensures consistent dimensionality across incremental updates and predictions.- Type:
- fit(X, y=None)#
Run full mini-batch training until convergence or max epochs.
- partial_fit(X_batch, y=None)#
Update model parameters using a single mini-batch.
- predict(X)#
Return hard cluster labels for samples.
- transform(X)#
Return distances from samples to cluster centers.
- membership(X)#
Compute soft membership (row-normalized responsibilities).
- fit_predict(X, y=None)#
Fit the model and return hard labels for X.
- fit_membership(X, y=None)#
Fit the model and return the membership matrix for the training data.
- (Internal helpers: `_init_centers`, `_resolve_alpha`, `_calc_weight`, `_approx_objective` are internal implementation details.)
Notes
The approximate objective is tracked on a monitoring subset when
monitor_size
is notNone
and stored inobjective_approx_
.Examples
>>> from sklekmeans import MiniBatchEKMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 0], [4, 4], ... [4, 5], [0, 1], [2, 2], ... [3, 2], [5, 5], [1, -1]]) >>> # manually fit on batches >>> ekmeans = MiniBatchEKMeans(n_clusters=2, ... random_state=0, ... batch_size=6) >>> ekmeans = ekmeans.partial_fit(X[0:6,:]) >>> ekmeans = ekmeans.partial_fit(X[6:12,:]) >>> ekmeans.cluster_centers_ array([[3.47914144, 3.02885195], [0.73800796, 0.61514045]]) >>> ekmeans.predict([[0, 0], [4, 4]]) array([1, 0]) >>> # fit on the whole data >>> ekmeans = MiniBatchEKMeans(n_clusters=2, ... random_state=0, ... batch_size=6, ... max_epochs=10).fit(X) >>> ekmeans.cluster_centers_ array([[3.51549642, 4.53433897], [1.98848002, 0.97403648]]) >>> ekmeans.predict([[0, 0], [4, 4]]) array([1, 0])
- __init__(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#
Methods
__init__
([n_clusters, metric, alpha, scale, ...])fit
(X[, y])Train the mini-batch equilibrium k-means estimator.
fit_membership
(X[, y])Fit to
X
and return the final membership matrix for training data.fit_predict
(X[, y])Fit to
X
and return hard assignments.fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
membership
(X)Compute soft membership matrix for samples in
X
.partial_fit
(X_batch[, y])Incrementally update the model with a single mini-batch.
predict
(X)Assign each sample in
X
to the closest learned center.set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
set_partial_fit_request
(*[, X_batch])Configure whether metadata should be requested to be passed to the
partial_fit
method.transform
(X)Compute distances from samples to cluster centers.
- fit(X, y=None)#
Train the mini-batch equilibrium k-means estimator.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – For API consistency.
- Returns:
self – Fitted estimator.
- Return type:
- fit_membership(X, y=None)#
Fit to
X
and return the final membership matrix for training data.
- fit_predict(X, y=None)#
Fit to
X
and return hard assignments.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
X
andy
with optional parametersfit_params
and returns a transformed version ofX
.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- membership(X)#
Compute soft membership matrix for samples in
X
.
- partial_fit(X_batch, y=None)#
Incrementally update the model with a single mini-batch.
- Parameters:
X_batch (array-like of shape (batch_size, n_features)) – Mini-batch of samples.
y (Ignored) – For API consistency.
- Returns:
self – Updated estimator.
- Return type:
- predict(X)#
Assign each sample in
X
to the closest learned center.
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of
transform
andfit_transform
."default"
: Default output format of a transformer"pandas"
: DataFrame output"polars"
: Polars outputNone
: Transform configuration is unchanged
Added in version 1.4:
"polars"
option was added.- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_partial_fit_request(*, X_batch: bool | None | str = '$UNCHANGED$') MiniBatchEKMeans #
Configure whether metadata should be requested to be passed to the
partial_fit
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topartial_fit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topartial_fit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- transform(X)#
Compute distances from samples to cluster centers.