sklekmeans.SSEKM#

class sklekmeans.SSEKM(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Semi-Supervised Equilibrium K-Means (batch).

Parameters:

n_clusters (int, default=8)
metric ({'euclidean', 'manhattan'}, default='euclidean')
alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter (same as EKMeans). If 'dvariance', a heuristic based on data variance is used.
scale (float, default=2.0) – Multiplicative factor for the 'dvariance' heuristic.
theta (float or {'auto'}, default='auto') –
Supervision strength for labeled samples.
- If a float is provided, it is used directly both in the supervised
  objective term and in the weight update for labeled rows: W = W_ekm + theta * b * (F_norm - W_ekm).
- If 'auto', set theta = |N| / |S| where |N| is the total
  number of samples and |S| is the number of labeled samples (rows of the prior with positive sum). When |S| = 0 (no supervision), theta = 0 and the estimator reduces to EKMeans.
max_iter (int, default=300)
tol (float, default=1e-4)
n_init (int, default=1)
init ({'k-means', 'k-means++', 'random'} or ndarray, default='k-means++')
random_state (int or None, default=None)
use_numba (bool, default=False)
numba_threads (int or None, default=None)
verbose (int, default=0)

cluster_centers_#

Final cluster centers.

Type:: ndarray of shape (n_clusters, n_features)

labels_#

Hard assignment labels for training data.

Type:: ndarray of shape (n_samples,)

n_iter_#

Number of iterations executed for the best initialisation.

Type:: int

objective_#

Objective value (including supervised term when provided) of the best run.

Type:: float

alpha_#

Resolved numeric alpha used during fitting.

Type:: float

theta_super_#

Resolved supervision strength used ('auto' or numeric).

Type:: float

W_#

Final equilibrium weights; for labeled rows, incorporates the prior via the mixing with theta.

Type:: ndarray of shape (n_samples, n_clusters)

U_#

Membership matrix based on exp-normalized distances before equilibrium correction. Each row sums to 1.

Type:: ndarray of shape (n_samples, n_clusters)

n_features_in_#

Number of features seen during fit().

Type:: int

Notes

Supervision is provided via a prior matrix F of shape
(n_samples, n_clusters). Pass this as prior_matrix=F to fit(). Rows with all zeros indicate unlabeled samples; otherwise values are class probabilities (labeled rows are row-normalized internally when positive).
Objective used for selection across initialisations is: sum(U * d^2) + theta * sum(b * (F - U) * d^2) where b is the labeled mask and U are exp-normalized memberships.
The average complexity is roughly $O(k^2 n T)$ due to the
weight update per iteration, where k is the number of clusters, n the number of samples and T the number of iterations. The algorithm can fall into local minima; using n_init>1 is recommended.

Examples

>>> from sklekmeans import SSEKM
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> F = np.array([[0, 1], [0, 0], [0, 0],
...               [1, 0], [0, 0], [0, 0]])
>>> ssekm = SSEKM(n_clusters=2, random_state=0, n_init=1).fit(X, prior_matrix=F)
>>> ssekm.labels_
array([1, 1, 1, 0, 0, 0])
>>> ssekm.predict([[0, 0], [12, 3]])
array([1, 0])
>>> ssekm.cluster_centers_
array([[10.,  1.9],
       [ 1.,  2.]])

__init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#

Methods

`__init__`([n_clusters, metric, alpha, scale, ...])
`fit`(X[, y, prior_matrix, F])	Fit the semi-supervised estimator on X.
`fit_membership`(X[, y, prior_matrix, F])	Fit to `X` and return the final membership matrix for training data.
`fit_predict`(X[, y, prior_matrix, F])	Fit the model and return hard labels for X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`membership`(X)	Compute soft membership matrix U for samples in X.
`predict`(X)	Predict the closest cluster index for each sample in X.
`set_fit_request`(*[, F, prior_matrix])	Configure whether metadata should be requested to be passed to the `fit` method.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Compute distances from samples to each cluster center.

fit(X, y=None, *, prior_matrix=None, F=None)#

Fit the semi-supervised estimator on X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer prior_matrix; F is kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide either prior_matrix or F (not both).
F (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer prior_matrix; F is kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide either prior_matrix or F (not both).

Returns:

self – Fitted estimator.

Return type:

SSEKM

Notes

Alpha resolution: if alpha='dvariance', a heuristic value is computed
as scale / mean(d^2) where d^2 are squared distances to the global mean under the chosen metric.
Theta policy: if theta='auto', we set theta = |N|/|S| where
|N| is the number of samples and |S| is the count of labeled rows (rows with positive sum in the prior). This value is used directly in the supervised objective and in the labeled-row weight blending W = W_ekm + theta * b * (F_norm - W_ekm).
On success the following attributes are populated: cluster_centers_,
labels_, n_iter_, objective_, alpha_, theta_super_, W_ and U_.

fit_membership(X, y=None, *, prior_matrix=None, F=None)#

Fit to X and return the final membership matrix for training data.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.
F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

U – Membership matrix U_ computed on the training data.

Return type:

ndarray of shape (n_samples, n_clusters)

fit_predict(X, y=None, *, prior_matrix=None, F=None)#

Fit the model and return hard labels for X.

This is equivalent to calling fit(X, ...) followed by predict(X) but may be more efficient.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.
F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

labels – Hard cluster assignments for the input samples.

Return type:

ndarray of shape (n_samples,)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

membership(X)#

Compute soft membership matrix U for samples in X.

Memberships are computed from distances using the fitted alpha_ via a row-wise normalization of exp(-alpha * d^2_shift) where d^2_shift = d^2 - min(d^2) per row for numerical stability. The prior matrix is not used at prediction time for this quantity.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input samples.
Returns:: U – Row-stochastic membership matrix (rows sum to 1) reflecting the soft responsibility of each sample to each cluster under the current centers and alpha_.
Return type:: ndarray of shape (n_samples, n_clusters)

predict(X)#

Predict the closest cluster index for each sample in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – New samples to assign.
Returns:: labels – Index of the closest learned cluster center per sample using the configured distance metric. Ties are broken by the first minimum.
Return type:: ndarray of shape (n_samples,)

set_fit_request(*, F: bool | None | str = '$UNCHANGED$', prior_matrix: bool | None | str = '$UNCHANGED$') → SSEKM#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

F (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for F parameter in fit.
prior_matrix (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior_matrix parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

transform(X)#

Compute distances from samples to each cluster center.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to transform.
Returns:: distances – Pairwise distances to cluster_centers_ computed with the estimator’s metric (‘euclidean’ or ‘manhattan’). For Euclidean, this returns non-squared distances consistent with scikit-learn’s transform convention.
Return type:: ndarray of shape (n_samples, n_clusters)

sklekmeans.SSEKM#

This Page