sklekmeans.SSEKM#

class sklekmeans.SSEKM(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Semi-Supervised Equilibrium K-Means (batch).

Parameters:
  • n_clusters (int, default=8)

  • metric ({'euclidean', 'manhattan'}, default='euclidean')

  • alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter (same as EKMeans). If 'dvariance', a heuristic based on data variance is used.

  • scale (float, default=2.0) – Multiplicative factor for the 'dvariance' heuristic.

  • theta (float or {'auto'}, default='auto') –

    Supervision strength for labeled samples.

    • If a float is provided, it is used directly both in the supervised

      objective term and in the weight update for labeled rows: W = W_ekm + theta * b * (F_norm - W_ekm).

    • If 'auto', set theta = |N| / |S| where |N| is the total

      number of samples and |S| is the number of labeled samples (rows of the prior with positive sum). When |S| = 0 (no supervision), theta = 0 and the estimator reduces to EKMeans.

  • max_iter (int, default=300)

  • tol (float, default=1e-4)

  • n_init (int, default=1)

  • init ({'k-means', 'k-means++', 'random'} or ndarray, default='k-means++')

  • random_state (int or None, default=None)

  • use_numba (bool, default=False)

  • numba_threads (int or None, default=None)

  • verbose (int, default=0)

cluster_centers_#

Final cluster centers.

Type:

ndarray of shape (n_clusters, n_features)

labels_#

Hard assignment labels for training data.

Type:

ndarray of shape (n_samples,)

n_iter_#

Number of iterations executed for the best initialisation.

Type:

int

objective_#

Objective value (including supervised term when provided) of the best run.

Type:

float

alpha_#

Resolved numeric alpha used during fitting.

Type:

float

theta_super_#

Resolved supervision strength used ('auto' or numeric).

Type:

float

W_#

Final equilibrium weights; for labeled rows, incorporates the prior via the mixing with theta.

Type:

ndarray of shape (n_samples, n_clusters)

U_#

Membership matrix based on exp-normalized distances before equilibrium correction. Each row sums to 1.

Type:

ndarray of shape (n_samples, n_clusters)

n_features_in_#

Number of features seen during fit().

Type:

int

Notes

  • Supervision is provided via a prior matrix F of shape

    (n_samples, n_clusters). Pass this as prior_matrix=F to fit(). Rows with all zeros indicate unlabeled samples; otherwise values are class probabilities (labeled rows are row-normalized internally when positive).

  • Objective used for selection across initialisations is: sum(U * d^2) + theta * sum(b * (F - U) * d^2) where b is the labeled mask and U are exp-normalized memberships.

  • The average complexity is roughly \(O(k^2 n T)\) due to the

    weight update per iteration, where k is the number of clusters, n the number of samples and T the number of iterations. The algorithm can fall into local minima; using n_init>1 is recommended.

Examples

>>> from sklekmeans import SSEKM
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> F = np.array([[0, 1], [0, 0], [0, 0],
...               [1, 0], [0, 0], [0, 0]])
>>> ssekm = SSEKM(n_clusters=2, random_state=0, n_init=1).fit(X, prior_matrix=F)
>>> ssekm.labels_
array([1, 1, 1, 0, 0, 0])
>>> ssekm.predict([[0, 0], [12, 3]])
array([1, 0])
>>> ssekm.cluster_centers_
array([[10.,  1.9],
       [ 1.,  2.]])
__init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Methods

__init__([n_clusters, metric, alpha, scale, ...])

fit(X[, y, prior_matrix, F])

Fit the semi-supervised estimator on X.

fit_membership(X[, y, prior_matrix, F])

Fit to X and return the final membership matrix for training data.

fit_predict(X[, y, prior_matrix, F])

Fit the model and return hard labels for X.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

membership(X)

Compute soft membership matrix U for samples in X.

predict(X)

Predict the closest cluster index for each sample in X.

set_fit_request(*[, F, prior_matrix])

Configure whether metadata should be requested to be passed to the fit method.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Compute distances from samples to each cluster center.

fit(X, y=None, *, prior_matrix=None, F=None)#

Fit the semi-supervised estimator on X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer prior_matrix; F is kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide either prior_matrix or F (not both).

  • F (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer prior_matrix; F is kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide either prior_matrix or F (not both).

Returns:

self – Fitted estimator.

Return type:

SSEKM

Notes

  • Alpha resolution: if alpha='dvariance', a heuristic value is computed

    as scale / mean(d^2) where d^2 are squared distances to the global mean under the chosen metric.

  • Theta policy: if theta='auto', we set theta = |N|/|S| where

    |N| is the number of samples and |S| is the count of labeled rows (rows with positive sum in the prior). This value is used directly in the supervised objective and in the labeled-row weight blending W = W_ekm + theta * b * (F_norm - W_ekm).

  • On success the following attributes are populated: cluster_centers_,

    labels_, n_iter_, objective_, alpha_, theta_super_, W_ and U_.

fit_membership(X, y=None, *, prior_matrix=None, F=None)#

Fit to X and return the final membership matrix for training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

  • F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

U – Membership matrix U_ computed on the training data.

Return type:

ndarray of shape (n_samples, n_clusters)

fit_predict(X, y=None, *, prior_matrix=None, F=None)#

Fit the model and return hard labels for X.

This is equivalent to calling fit(X, ...) followed by predict(X) but may be more efficient.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

  • F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

labels – Hard cluster assignments for the input samples.

Return type:

ndarray of shape (n_samples,)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

membership(X)#

Compute soft membership matrix U for samples in X.

Memberships are computed from distances using the fitted alpha_ via a row-wise normalization of exp(-alpha * d^2_shift) where d^2_shift = d^2 - min(d^2) per row for numerical stability. The prior matrix is not used at prediction time for this quantity.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

U – Row-stochastic membership matrix (rows sum to 1) reflecting the soft responsibility of each sample to each cluster under the current centers and alpha_.

Return type:

ndarray of shape (n_samples, n_clusters)

predict(X)#

Predict the closest cluster index for each sample in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – New samples to assign.

Returns:

labels – Index of the closest learned cluster center per sample using the configured distance metric. Ties are broken by the first minimum.

Return type:

ndarray of shape (n_samples,)

set_fit_request(*, F: bool | None | str = '$UNCHANGED$', prior_matrix: bool | None | str = '$UNCHANGED$') SSEKM#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • F (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for F parameter in fit.

  • prior_matrix (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior_matrix parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

transform(X)#

Compute distances from samples to each cluster center.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to transform.

Returns:

distances – Pairwise distances to cluster_centers_ computed with the estimator’s metric (‘euclidean’ or ‘manhattan’). For Euclidean, this returns non-squared distances consistent with scikit-learn’s transform convention.

Return type:

ndarray of shape (n_samples, n_clusters)