sklekmeans.SSEKM#
- class sklekmeans.SSEKM(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Semi-Supervised Equilibrium K-Means (batch).
- Parameters:
n_clusters (int, default=8)
metric ({'euclidean', 'manhattan'}, default='euclidean')
alpha (float or {'dvariance'}, default='dvariance') – Equilibrium weighting parameter (same as EKMeans). If
'dvariance', a heuristic based on data variance is used.scale (float, default=2.0) – Multiplicative factor for the
'dvariance'heuristic.theta (float or {'auto'}, default='auto') –
Supervision strength for labeled samples.
- If a float is provided, it is used directly both in the supervised
objective term and in the weight update for labeled rows:
W = W_ekm + theta * b * (F_norm - W_ekm).
- If
'auto', settheta = |N| / |S|where|N|is the total number of samples and
|S|is the number of labeled samples (rows of the prior with positive sum). When|S| = 0(no supervision),theta = 0and the estimator reduces to EKMeans.
- If
max_iter (int, default=300)
tol (float, default=1e-4)
n_init (int, default=1)
init ({'k-means', 'k-means++', 'random'} or ndarray, default='k-means++')
random_state (int or None, default=None)
use_numba (bool, default=False)
numba_threads (int or None, default=None)
verbose (int, default=0)
- cluster_centers_#
Final cluster centers.
- Type:
ndarray of shape (n_clusters, n_features)
- labels_#
Hard assignment labels for training data.
- Type:
ndarray of shape (n_samples,)
- W_#
Final equilibrium weights; for labeled rows, incorporates the prior via the mixing with
theta.- Type:
ndarray of shape (n_samples, n_clusters)
- U_#
Membership matrix based on exp-normalized distances before equilibrium correction. Each row sums to 1.
- Type:
ndarray of shape (n_samples, n_clusters)
Notes
- Supervision is provided via a prior matrix
Fof shape (n_samples, n_clusters). Pass this asprior_matrix=Ftofit(). Rows with all zeros indicate unlabeled samples; otherwise values are class probabilities (labeled rows are row-normalized internally when positive).
- Supervision is provided via a prior matrix
Objective used for selection across initialisations is:
sum(U * d^2) + theta * sum(b * (F - U) * d^2)wherebis the labeled mask andUare exp-normalized memberships.- The average complexity is roughly \(O(k^2 n T)\) due to the
weight update per iteration, where
kis the number of clusters,nthe number of samples andTthe number of iterations. The algorithm can fall into local minima; usingn_init>1is recommended.
Examples
>>> from sklekmeans import SSEKM >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> F = np.array([[0, 1], [0, 0], [0, 0], ... [1, 0], [0, 0], [0, 0]]) >>> ssekm = SSEKM(n_clusters=2, random_state=0, n_init=1).fit(X, prior_matrix=F) >>> ssekm.labels_ array([1, 1, 1, 0, 0, 0]) >>> ssekm.predict([[0, 0], [12, 3]]) array([1, 0]) >>> ssekm.cluster_centers_ array([[10., 1.9], [ 1., 2.]])
- __init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', max_iter=300, tol=0.0001, n_init=1, init='k-means++', random_state=None, use_numba=False, numba_threads=None, verbose=0)#
Methods
__init__([n_clusters, metric, alpha, scale, ...])fit(X[, y, prior_matrix, F])Fit the semi-supervised estimator on X.
fit_membership(X[, y, prior_matrix, F])Fit to
Xand return the final membership matrix for training data.fit_predict(X[, y, prior_matrix, F])Fit the model and return hard labels for X.
fit_transform(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
membership(X)Compute soft membership matrix U for samples in X.
predict(X)Predict the closest cluster index for each sample in X.
set_fit_request(*[, F, prior_matrix])Configure whether metadata should be requested to be passed to the
fitmethod.set_output(*[, transform])Set output container.
set_params(**params)Set the parameters of this estimator.
transform(X)Compute distances from samples to each cluster center.
- fit(X, y=None, *, prior_matrix=None, F=None)#
Fit the semi-supervised estimator on X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer
prior_matrix;Fis kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide eitherprior_matrixorF(not both).F (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix providing supervision. Prefer
prior_matrix;Fis kept for backward compatibility. Rows with all zeros denote unlabeled samples; labeled rows are row-normalized internally when positive. Provide eitherprior_matrixorF(not both).
- Returns:
self – Fitted estimator.
- Return type:
Notes
- Alpha resolution: if
alpha='dvariance', a heuristic value is computed as
scale / mean(d^2)whered^2are squared distances to the global mean under the chosen metric.
- Alpha resolution: if
- Theta policy: if
theta='auto', we settheta = |N|/|S|where |N|is the number of samples and|S|is the count of labeled rows (rows with positive sum in the prior). This value is used directly in the supervised objective and in the labeled-row weight blendingW = W_ekm + theta * b * (F_norm - W_ekm).
- Theta policy: if
- On success the following attributes are populated:
cluster_centers_, labels_,n_iter_,objective_,alpha_,theta_super_,W_andU_.
- On success the following attributes are populated:
- fit_membership(X, y=None, *, prior_matrix=None, F=None)#
Fit to
Xand return the final membership matrix for training data.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer
prior_matrix;Fis kept for backward compatibility. Provide only one of them.F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer
prior_matrix;Fis kept for backward compatibility. Provide only one of them.
- Returns:
U – Membership matrix
U_computed on the training data.- Return type:
ndarray of shape (n_samples, n_clusters)
- fit_predict(X, y=None, *, prior_matrix=None, F=None)#
Fit the model and return hard labels for X.
This is equivalent to calling
fit(X, ...)followed bypredict(X)but may be more efficient.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Present for API consistency.
prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer
prior_matrix;Fis kept for backward compatibility. Provide only one of them.F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer
prior_matrix;Fis kept for backward compatibility. Provide only one of them.
- Returns:
labels – Hard cluster assignments for the input samples.
- Return type:
ndarray of shape (n_samples,)
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- membership(X)#
Compute soft membership matrix U for samples in X.
Memberships are computed from distances using the fitted
alpha_via a row-wise normalization ofexp(-alpha * d^2_shift)whered^2_shift = d^2 - min(d^2)per row for numerical stability. The prior matrix is not used at prediction time for this quantity.- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
- Returns:
U – Row-stochastic membership matrix (rows sum to 1) reflecting the soft responsibility of each sample to each cluster under the current centers and
alpha_.- Return type:
ndarray of shape (n_samples, n_clusters)
- predict(X)#
Predict the closest cluster index for each sample in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – New samples to assign.
- Returns:
labels – Index of the closest learned cluster center per sample using the configured distance metric. Ties are broken by the first minimum.
- Return type:
ndarray of shape (n_samples,)
- set_fit_request(*, F: bool | None | str = '$UNCHANGED$', prior_matrix: bool | None | str = '$UNCHANGED$') SSEKM#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- Returns:
self – The updated object.
- Return type:
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of
transformandfit_transform."default": Default output format of a transformer"pandas": DataFrame output"polars": Polars outputNone: Transform configuration is unchanged
Added in version 1.4:
"polars"option was added.- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- transform(X)#
Compute distances from samples to each cluster center.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to transform.
- Returns:
distances – Pairwise distances to
cluster_centers_computed with the estimator’smetric(‘euclidean’ or ‘manhattan’). For Euclidean, this returns non-squared distances consistent with scikit-learn’stransformconvention.- Return type:
ndarray of shape (n_samples, n_clusters)