sklekmeans.EKMeans#
- class sklekmeans.EKMeans(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Equilibrium K-Means clustering.
A robust variant of k-means designed for imbalanced datasets. The method uses an equilibrium weighting scheme parameterised by
alpha
. Foralpha='dvariance'
a heuristic based on the data variance is used.- Parameters:
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used both to assign points to clusters and to update centers. Manhattan distance can be more robust to outliers in some settings but increases cost relative to vectorised squared Euclidean computations.
alpha (float or {'dvariance'}, default=0.5) – Equilibrium weighting parameter. If set to the string
'dvariance'
a heuristic valuescale / mean(d^2)
is computed whered^2
are squared distances to the global mean.scale (float, default=2.0) – Multiplicative factor applied in the
'dvariance'
heuristic. Higher values yield larger effectivealpha
resulting in crisper assignments.max_iter (int, default=300) – Maximum number of EM-like update iterations for a single initialisation.
tol (float, default=1e-4) – Relative tolerance (scaled by average feature variance of the data) on the Frobenius norm of the change in
cluster_centers_
to declare convergence.n_init (int, default=1) – Number of random initialisations to perform. The run with the lowest internal equilibrium objective is retained. Increasing
n_init
improves robustness to local minima at additional computational cost.init ({'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') –
Method for initialization. * ‘k-means++’ : use a probabilistic seeding adapted for the
chosen metric.
’random’ : choose
n_clusters
observations at random.ndarray : user provided initial centers.
random_state (int, RandomState instance or None, default=None) – Controls the randomness of initial center selection and the heuristic alpha sampling (when applicable). Pass an int for reproducible results.
use_numba (bool, default=False) – If
True
andnumba
is installed (see[speed]
extra), use a JIT-compiled kernel for weight computation.numba_threads (int or None, default=None) – If provided sets the number of threads used by numba parallel sections. Ignored if numba is unavailable or
use_numba
isFalse
.verbose (int, default=0) – Verbosity level.
0
is silent; higher values print progress each iteration.
- cluster_centers_#
Final cluster centers.
- Type:
ndarray of shape (n_clusters, n_features)
- labels_#
Hard assignment labels for training data.
- Type:
ndarray of shape (n_samples,)
- W_#
Equilibrium weights after fitting.
- Type:
ndarray of shape (n_samples, n_clusters)
- U_#
Membership matrix (soft assignments based on exp(-alpha * d^2)). Each row sums to 1.
- Type:
ndarray of shape (n_samples, n_clusters)
- n_features_in_#
Number of features seen during
fit()
. Set by the first call tofit()
and used for input validation in subsequent operations.- Type:
- fit(X, y=None)#
Fit the model and learn cluster centers.
- predict(X)#
Return the hard cluster label (nearest center) for each sample.
- transform(X)#
Return matrix of distances from samples to cluster centers.
- fit_predict(X, y=None)#
Fit the model and return training labels in one pass.
- membership(X)#
Compute soft membership (row-normalized responsibilities).
- fit_membership(X, y=None)#
Fit the model and return the training membership matrix.
- (Internal helpers: `_resolve_alpha`, `_init_centers`, `_calc_weight`, `_objective` are internal and not public API.)
Notes
The average complexity is roughly \(O(k^2 n T)\) due to the weight update per iteration, where
k
is the number of clusters,n
the number of samples andT
the number of iterations. The algorithm can fall into local minima; usingn_init>1
is recommended.Examples
>>> from sklekmeans import EKMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> ekmeans = EKMeans(n_clusters=2, random_state=0, n_init=1).fit(X) >>> ekmeans.labels_ array([1, 1, 1, 0, 0, 0]) >>> ekmeans.predict([[0, 0], [12, 3]]) array([1, 0]) >>> ekmeans.cluster_centers_ array([[10., 2.], [ 1., 2.]])
- __init__(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Methods
__init__
([n_clusters, metric, alpha, scale, ...])fit
(X[, y])Compute Equilibrium K-Means clustering.
fit_membership
(X[, y])Fit the model and return the membership matrix for training data.
fit_predict
(X[, y])Fit the model to
X
and return cluster indices.fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
membership
(X)Return membership (soft assignment) matrix.
predict
(X)Predict the closest cluster index for each sample in
X
.set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Compute distances of samples to each cluster center.
- fit(X, y=None)#
Compute Equilibrium K-Means clustering.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training instances.
y (Ignored) – Present for API consistency.
- Returns:
self – Fitted estimator.
- Return type:
- fit_membership(X, y=None)#
Fit the model and return the membership matrix for training data.
- fit_predict(X, y=None)#
Fit the model to
X
and return cluster indices.Equivalent to calling
fit(X)
followed bypredict(X)
but more efficient.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
X
andy
with optional parametersfit_params
and returns a transformed version ofX
.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- membership(X)#
Return membership (soft assignment) matrix.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
- Returns:
U – Row-stochastic soft assignment matrix (rows sum to 1).
- Return type:
ndarray of shape (n_samples, n_clusters)
- predict(X)#
Predict the closest cluster index for each sample in
X
.- Parameters:
X (array-like of shape (n_samples, n_features)) – New samples.
- Returns:
labels – Index of the closest learned cluster center for each sample.
- Return type:
ndarray of shape (n_samples,)
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of
transform
andfit_transform
."default"
: Default output format of a transformer"pandas"
: DataFrame output"polars"
: Polars outputNone
: Transform configuration is unchanged
Added in version 1.4:
"polars"
option was added.- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- transform(X)#
Compute distances of samples to each cluster center.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to transform.
- Returns:
distances – Pairwise distances to
cluster_centers_
using the configured metric.- Return type:
ndarray of shape (n_samples, n_clusters)