sklekmeans.MiniBatchSSEKM#

class sklekmeans.MiniBatchSSEKM(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Mini-batch SSEKM.

Mini-batch optimisation of the semi-supervised equilibrium k-means objective. Supervision is provided via a prior matrix, using the prior_matrix keyword to fit() and prior_matrix_batch to partial_fit(). Labeled rows in the prior influence weights via the mixing factor theta.

Parameters:
n_clustersint, default=8
metric{‘euclidean’, ‘manhattan’}, default=’euclidean’
alphafloat or {‘dvariance’}, default=’dvariance’

Equilibrium weighting parameter ('dvariance' uses a subsample to estimate a heuristic value scaled by scale).

scalefloat, default=2.0

Scaling factor for the heuristic alpha.

thetafloat or {‘auto’}, default=’auto’

Supervision strength. 'auto' sets theta = |N| / |S|. Numeric values are used directly in both the objective and the labeled-row weight update.

batch_sizeint, default=256
max_epochsint, default=10
n_initint, default=1
init{‘k-means’, ‘k-means++’, ‘random’} or ndarray, default=’k-means++’
init_sizeint or None, default=None
shufflebool, default=True
learning_ratefloat or None, default=None
tolfloat, default=1e-4
reassignment_ratiofloat, default=0.0
reassign_patienceint, default=3
verboseint, default=0
monitor_sizeint or None, default=1024
print_everyint, default=1
use_numbabool, default=False
numba_threadsint or None, default=None
random_stateint or None, default=None
Attributes:
cluster_centers_ndarray of shape (n_clusters, n_features)

Final centers after training.

labels_ndarray of shape (n_samples,)

Hard assignment labels for the training data (available after fit()).

alpha_float

Resolved alpha value.

theta_super_float

Resolved supervision strength used ('auto' or numeric).

objective_approx_list of float

Epoch-wise approximate objectives measured on a monitoring subset.

counts_ndarray of shape (n_clusters,)

Accumulated batch weights per cluster (accumulation mode; present after fit()).

sums_ndarray of shape (n_clusters, n_features)

Accumulated weighted sums per cluster (accumulation mode; present after fit()).

W_, U_ndarrays

Final equilibrium weights and memberships for the full training data (set by fit()).

n_epochs_int

Number of epochs run in the best initialisation.

n_features_in_int

Number of features seen during the first call to fit() or partial_fit().

Notes

  • Provide the full-dataset prior using prior_matrix to fit(), or mini-batch priors using prior_matrix_batch to partial_fit().

  • Unlabeled rows are all zeros; labeled rows are row-normalized when positive.

  • The monitoring objective returned in objective_approx_ includes the supervised term scaled by theta when a prior is provided.

>>> ssekm.cluster_centers_
 array([[1.25126245, 0.55312346],
     [3.54580155, 3.51798824]])
 >>> ssekm.predict([[0, 0], [4, 4]])
 array([0, 1])
__init__(n_clusters=8, *, metric='euclidean', alpha='dvariance', scale=2.0, theta='auto', batch_size=256, max_epochs=10, n_init=1, init='k-means++', init_size=None, shuffle=True, learning_rate=None, tol=0.0001, reassignment_ratio=0.0, reassign_patience=3, verbose=0, monitor_size=1024, print_every=1, use_numba=False, numba_threads=None, random_state=None)#

Methods

__init__([n_clusters, metric, alpha, scale, ...])

fit(X[, y, prior_matrix, F])

Train the mini-batch semi-supervised estimator on the full dataset.

fit_membership(X[, y, prior_matrix, F])

Fit to X and return the final membership matrix for training data.

fit_predict(X[, y, prior_matrix, F])

Fit the model and return hard labels for X.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

membership(X)

Soft membership (U) computed from distances using current alpha_.

partial_fit(X_batch[, y, ...])

predict(X)

Predict the closest cluster each sample in X belongs to.

set_fit_request(*[, F, prior_matrix])

Configure whether metadata should be requested to be passed to the fit method.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

set_partial_fit_request(*[, F_batch, ...])

Configure whether metadata should be requested to be passed to the partial_fit method.

transform(X)

Transform X to a cluster-distance space (pairwise distances).

fit(X, y=None, *, prior_matrix=None, F=None)#

Train the mini-batch semi-supervised estimator on the full dataset.

Runs multiple epochs of mini-batch updates. Supervision can be provided via prior_matrix (preferred) or F; provide only one. Unlabeled rows are all zeros; labeled rows are row-normalized internally when positive.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix for supervision.

  • F (array-like of shape (n_samples, n_clusters), optional) – Prior probability matrix for supervision.

Returns:

self – Fitted estimator.

Return type:

MiniBatchSSEKM

fit_membership(X, y=None, *, prior_matrix=None, F=None)#

Fit to X and return the final membership matrix for training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

  • F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

U – Membership matrix U_ computed on the training data.

Return type:

ndarray of shape (n_samples, n_clusters)

fit_predict(X, y=None, *, prior_matrix=None, F=None)#

Fit the model and return hard labels for X.

Performs full mini-batch training (up to max_epochs) and returns the predicted cluster index for each sample in X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Present for API consistency.

  • prior_matrix (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

  • F (array-like of shape (n_samples, n_clusters), optional) – Supervision prior. Prefer prior_matrix; F is kept for backward compatibility. Provide only one of them.

Returns:

labels – Hard cluster assignments for the input samples.

Return type:

ndarray of shape (n_samples,)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

membership(X)#

Soft membership (U) computed from distances using current alpha_.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples for which to compute memberships.

Returns:

U – Row-stochastic membership matrix computed as normalized exp(-alpha * d^2_shift) per row.

Return type:

ndarray of shape (n_samples, n_clusters)

predict(X)#

Predict the closest cluster each sample in X belongs to.

Parameters:

X (array-like of shape (n_samples, n_features)) – New samples to assign.

Returns:

labels – Indices of the nearest centers under the configured metric.

Return type:

ndarray of shape (n_samples,)

set_fit_request(*, F: bool | None | str = '$UNCHANGED$', prior_matrix: bool | None | str = '$UNCHANGED$') MiniBatchSSEKM#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • F (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for F parameter in fit.

  • prior_matrix (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior_matrix parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_partial_fit_request(*, F_batch: bool | None | str = '$UNCHANGED$', X_batch: bool | None | str = '$UNCHANGED$', prior_matrix_batch: bool | None | str = '$UNCHANGED$') MiniBatchSSEKM#

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to partial_fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • F_batch (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for F_batch parameter in partial_fit.

  • X_batch (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_batch parameter in partial_fit.

  • prior_matrix_batch (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior_matrix_batch parameter in partial_fit.

Returns:

self – The updated object.

Return type:

object

transform(X)#

Transform X to a cluster-distance space (pairwise distances).

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to transform.

Returns:

distances – Pairwise distances to cluster_centers_ using the estimator’s metric.

Return type:

ndarray of shape (n_samples, n_clusters)