Imbalanced Data Clustering

A simple k-means type algorithm for clustering imbalanced data

Introduction


Centroid-based clustering algorithms, such as hard K-means (HKM) and fuzzy K-means (FKM), have suffered from learning bias towards large clusters. Their centroids tend to be crowded in large clusters, compromising performance when the true underlying data groups vary in size (i.e., imbalanced data). To address this, we propose a novel K-means type clustering algorithm, called equilibrium K-means (EKM), with a new objective function mitigating the large cluster learning bias. Besides robust to imbalanced data, EKM also has the following advantages:

  • Simplicity: Iterating between just two steps.
  • Resource-Saving: Time complexity proportional to the data/instance number.
  • Scalability: scalable to large datasets via batch learning.

We have uploaded the original paper on arXiv.

Example


Implementation


Matlab code is available here.