sklekmeans.EKMeans#

class sklekmeans.EKMeans(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Equilibrium K-Means clustering.

A robust variant of k-means designed for imbalanced datasets. The method uses an equilibrium weighting scheme parameterised by alpha. For alpha='dvariance' a heuristic based on the data variance is used.

Parameters:
  • n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.

  • metric ({'euclidean', 'manhattan'}, default='euclidean') – Distance metric used both to assign points to clusters and to update centers. Manhattan distance can be more robust to outliers in some settings but increases cost relative to vectorised squared Euclidean computations.

  • alpha (float or {'dvariance'}, default=0.5) – Equilibrium weighting parameter. If set to the string 'dvariance' a heuristic value scale / mean(d^2) is computed where d^2 are squared distances to the global mean.

  • scale (float, default=2.0) – Multiplicative factor applied in the 'dvariance' heuristic. Higher values yield larger effective alpha resulting in crisper assignments.

  • max_iter (int, default=300) – Maximum number of EM-like update iterations for a single initialisation.

  • tol (float, default=1e-4) – Relative tolerance (scaled by average feature variance of the data) on the Frobenius norm of the change in cluster_centers_ to declare convergence.

  • n_init (int, default=1) – Number of random initialisations to perform. The run with the lowest internal equilibrium objective is retained. Increasing n_init improves robustness to local minima at additional computational cost.

  • init ({'k-means++', 'random'} or ndarray of shape (n_clusters, n_features), default='k-means++') –

    Method for initialization. * ‘k-means++’ : use a probabilistic seeding adapted for the

    chosen metric.

    • ’random’ : choose n_clusters observations at random.

    • ndarray : user provided initial centers.

  • random_state (int, RandomState instance or None, default=None) – Controls the randomness of initial center selection and the heuristic alpha sampling (when applicable). Pass an int for reproducible results.

  • use_numba (bool, default=False) – If True and numba is installed (see [speed] extra), use a JIT-compiled kernel for weight computation.

  • numba_threads (int or None, default=None) – If provided sets the number of threads used by numba parallel sections. Ignored if numba is unavailable or use_numba is False.

  • verbose (int, default=0) – Verbosity level. 0 is silent; higher values print progress each iteration.

cluster_centers_#

Final cluster centers.

Type:

ndarray of shape (n_clusters, n_features)

labels_#

Hard assignment labels for training data.

Type:

ndarray of shape (n_samples,)

n_iter_#

Number of iterations run for the best initialisation.

Type:

int

objective_#

Objective value of the best run.

Type:

float

alpha_#

Resolved alpha value actually used.

Type:

float

W_#

Equilibrium weights after fitting.

Type:

ndarray of shape (n_samples, n_clusters)

U_#

Membership matrix (soft assignments based on exp(-alpha * d^2)). Each row sums to 1.

Type:

ndarray of shape (n_samples, n_clusters)

n_features_in_#

Number of features seen during fit(). Set by the first call to fit() and used for input validation in subsequent operations.

Type:

int

fit(X, y=None)#

Fit the model and learn cluster centers.

predict(X)#

Return the hard cluster label (nearest center) for each sample.

transform(X)#

Return matrix of distances from samples to cluster centers.

fit_predict(X, y=None)#

Fit the model and return training labels in one pass.

membership(X)#

Compute soft membership (row-normalized responsibilities).

fit_membership(X, y=None)#

Fit the model and return the training membership matrix.

(Internal helpers: `_resolve_alpha`, `_init_centers`, `_calc_weight`, `_objective` are internal and not public API.)

Notes

The average complexity is roughly \(O(k^2 n T)\) due to the weight update per iteration, where k is the number of clusters, n the number of samples and T the number of iterations. The algorithm can fall into local minima; using n_init>1 is recommended.

Examples

>>> from sklekmeans import EKMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> ekmeans = EKMeans(n_clusters=2, random_state=0, n_init=1).fit(X)
>>> ekmeans.labels_
array([1, 1, 1, 0, 0, 0])
>>> ekmeans.predict([[0, 0], [12, 3]])
array([1, 0])
>>> ekmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])
__init__(n_clusters=8, *, metric='euclidean', alpha=0.5, scale=2.0, max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Methods

__init__([n_clusters, metric, alpha, scale, ...])

fit(X[, y])

Compute Equilibrium K-Means clustering.

fit_membership(X[, y])

Fit the model and return the membership matrix for training data.

fit_predict(X[, y])

Fit the model to X and return cluster indices.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

membership(X)

Return membership (soft assignment) matrix.

predict(X)

Predict the closest cluster index for each sample in X.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Compute distances of samples to each cluster center.

fit(X, y=None)#

Compute Equilibrium K-Means clustering.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training instances.

  • y (Ignored) – Present for API consistency.

Returns:

self – Fitted estimator.

Return type:

object

fit_membership(X, y=None)#

Fit the model and return the membership matrix for training data.

fit_predict(X, y=None)#

Fit the model to X and return cluster indices.

Equivalent to calling fit(X) followed by predict(X) but more efficient.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

membership(X)#

Return membership (soft assignment) matrix.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

U – Row-stochastic soft assignment matrix (rows sum to 1).

Return type:

ndarray of shape (n_samples, n_clusters)

predict(X)#

Predict the closest cluster index for each sample in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – New samples.

Returns:

labels – Index of the closest learned cluster center for each sample.

Return type:

ndarray of shape (n_samples,)

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

transform(X)#

Compute distances of samples to each cluster center.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to transform.

Returns:

distances – Pairwise distances to cluster_centers_ using the configured metric.

Return type:

ndarray of shape (n_samples, n_clusters)