sklekmeans.EKMeans#
- class sklekmeans.EKMeans(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Equilibrium K-Means clustering.
A robust variant of k-means designed for imbalanced datasets. The method uses an equilibrium weighting scheme parameterised by
alpha. Foralpha='dvariance'a heuristic based on the data variance is used.- Parameters:
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used both to assign points to clusters and to update centers. Manhattan distance can be more robust to outliers in some settings but increases cost relative to vectorised squared Euclidean computations.
alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter. If set to the string
'dvariance'a heuristic valuescale / mean(d^2)is computed whered^2are squared distances to the global mean.scale (float, default=2.0) – Multiplicative factor applied in the
'dvariance'heuristic. Higher values yield larger effectivealpharesulting in crisper assignments.max_iter (int, default=300) – Maximum number of EM-like update iterations for a single initialisation.
tol (float, default=1e-4) – Relative tolerance (scaled by average feature variance of the data) on the Frobenius norm of the change in
cluster_centers_to declare convergence.n_init (int, default=1) – Number of random initialisations to perform. The run with the lowest internal equilibrium objective is retained. Increasing
n_initimproves robustness to local minima at additional computational cost.init ({'k-means', 'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') –
Method for initialization.
- ’k-means’: run a short standard k-means (Euclidean) to obtain
initial centers. When using a non-Euclidean metric, this serves as a heuristic seeding.
’k-means++’: probabilistic seeding adapted for the chosen metric.
’random’: choose
n_clustersobservations at random.- ndarray: user-provided initial centers with shape
(n_clusters, n_features).
random_state (int, RandomState instance or None, default=None) – Controls the randomness of initial center selection and the heuristic alpha sampling (when applicable). Pass an int for reproducible results.
use_numba (bool, default=False) – If
Trueandnumbais installed (see[speed]extra), use a JIT-compiled kernel for weight computation.numba_threads (int or None, default=None) – If provided sets the number of threads used by numba parallel sections. Ignored if numba is unavailable or
use_numbaisFalse.verbose (int, default=0) – Verbosity level.
0is silent; higher values print progress each iteration.
- cluster_centers_#
Final cluster centers.
- Type:
ndarray of shape (n_clusters, n_features)
- labels_#
Hard assignment labels for training data.
- Type:
ndarray of shape (n_samples,)
- W_#
Equilibrium weights after fitting.
- Type:
ndarray of shape (n_samples, n_clusters)
- U_#
Membership matrix (soft assignments based on exp(-alpha * d^2)). Each row sums to 1.
- Type:
ndarray of shape (n_samples, n_clusters)
- n_features_in_#
Number of features seen during
fit(). Set by the first call tofit()and used for input validation in subsequent operations.- Type:
Notes
The average complexity is roughly \(O(k^2 n T)\) due to the weight update per iteration, where
kis the number of clusters,nthe number of samples andTthe number of iterations. The algorithm can fall into local minima; usingn_init>1is recommended.Examples
>>> from sklekmeans import EKMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> ekmeans = EKMeans(n_clusters=2, random_state=0, n_init=1).fit(X) >>> ekmeans.labels_ array([1, 1, 1, 0, 0, 0]) >>> ekmeans.predict([[0, 0], [12, 3]]) array([1, 0]) >>> ekmeans.cluster_centers_ array([[10.04, 2.], [ 0.96, 2.]])
- __init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Methods
__init__([n_clusters, metric, alpha, scale, ...])fit(X[, y])Compute Equilibrium K-Means clustering.
fit_membership(X[, y])Fit the model and return the membership matrix for training data.
fit_predict(X[, y])Fit the model to
Xand return cluster indices.fit_transform(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
membership(X)Return membership (soft assignment) matrix.
predict(X)Predict the closest cluster index for each sample in
X.set_output(*[, transform])Set output container.
set_params(**params)Set the parameters of this estimator.
transform(X)Compute distances of samples to each cluster center.
- fit(X, y=None)#
Compute Equilibrium K-Means clustering.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training instances.
y (Ignored) – Present for API consistency.
- Returns:
self – Fitted estimator.
- Return type:
- fit_membership(X, y=None)#
Fit the model and return the membership matrix for training data.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (Ignored) – Present for API consistency.
- Returns:
U – Row-stochastic soft assignment matrix (rows sum to 1).
- Return type:
ndarray of shape (n_samples, n_clusters)
- fit_predict(X, y=None)#
Fit the model to
Xand return cluster indices.Equivalent to calling
fit(X)followed bypredict(X)but may be more efficient.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
- Returns:
labels – Hard cluster assignments for the input samples.
- Return type:
ndarray of shape (n_samples,)
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- membership(X)#
Return membership (soft assignment) matrix.
Memberships are computed from distances using the fitted
alpha_via a row-wise normalization ofexp(-alpha * d^2_shift)whered^2_shift = d^2 - min(d^2)per row for numerical stability.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
- Returns:
U – Row-stochastic soft assignment matrix (rows sum to 1).
- Return type:
ndarray of shape (n_samples, n_clusters)
- predict(X)#
Predict the closest cluster index for each sample in
X.- Parameters:
X (array-like of shape (n_samples, n_features)) – New samples.
- Returns:
labels – Index of the closest learned cluster center for each sample.
- Return type:
ndarray of shape (n_samples,)
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of
transformandfit_transform."default": Default output format of a transformer"pandas": DataFrame output"polars": Polars outputNone: Transform configuration is unchanged
Added in version 1.4:
"polars"option was added.- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- transform(X)#
Compute distances of samples to each cluster center.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to transform.
- Returns:
distances – Pairwise distances to
cluster_centers_using the configuredmetric(‘euclidean’ or ‘manhattan’). For Euclidean, this returns non-squared distances (consistent with scikit-learn’s convention).- Return type:
ndarray of shape (n_samples, n_clusters)