Mini Batch K-means clustering algorithm in Machine Learning

Machine Learning Server Side Programming Programming

Introduction

Clustering is a technique to group data points into various subgroups such that each point within each subgroup are similar. It is an unsupervised algorithm and there are no labels or ground truth. Mini batch K Means is a variant of the K−Means algorithm that trains from batches at random from memory.

In this article let us understand Mini Batch K−Means in detail. Before moving on to Mini Batch K−Means let us have a look at K−Means in general

The K−Means clustering approach

The K−Means is an iterative approach that tries to group data points into K separate subgroups such that they are non−overlapping. The points within a cluster are similar as possible and the points between two clusters are as dissimilar as possible. The algorithm also makes the sum of intra−cluster distances between the points and centroid of a cluster as small as possible and inter clusters distances as large as possible. A point can belong to one cluster or subgroup.

Mini Batch KM−Means clustering

The concept behind Mini batch K−Means clusters is to store small batches of fixed−size data points in memory. In each iteration, a mini−batch is randomly taken from the dataset and the data points in the mini−batch only are used to update the centroids of the clusters. This saves us from using the entire data set at once as seen in the K−Means algorithm that solves any memory issues. The algorithm converges faster. The learning rate generally decreases with the number of iterations since it is inversely proportional to the data assigned. In mini−batch the cluster updates are done using a convex combination of the data and prototypes and with the learning t−rate decreasing over iterations. When repetitions increase the effect of adding new data reduces and convergence happens faster and is observed when with two consecutive iterations the centroid does not get affected.

Working of Mini batch K−Means clustering

Centroids of clusters are randomly initialized.
A mini−batch of data is randomly selected from the original dataset.
Each data point is assigned to the centroid closest to it
The cluster centroids are calculated using the assigned points from the mini−batch
The process from 2 to 4 is repeated until no change in the centroid position
The final clusters are obtained.

Python Implementation of Minibatch K−Means

In the below example we have used KMeans clustering with mini batches on 2000 data points.The initial cluster centers are defined and the model is then trained using the data to find the final cluster centers and plot them.

from sklearn.cluster import MiniBatchKMeans
from sklearn.datasets import make_blobs as blobs
import matplotlib.pyplot as plt
import timeit as t

c = [[50, 50],[1900, 0],[1900, 900],[0, 1900]]
data, data_labels = blobs(n_samples = 2000, centers = c, cluster_std = 200)

color = ['pink', 'violet', 'green', 'blue']
for i in range(len(data)):
  plt.scatter(data[i][0], data[i][1], color = color[data_labels[i]], alpha = 0.4)

k_means = MiniBatchKMeans(n_clusters=4, batch_size = 40)
st = t.default_timer()
k_means.fit(data)
e = t.default_timer()
label_a = k_means.labels_
cnt = k_means.cluster_centers_
print("Time taken : ",e-st)

for i in range(len(data)):
    plt.scatter(data[i][0],data[i][1], color = color[label_a[i]], alpha = 0.4)
for i in range(len(cnt)):
    plt.scatter(cnt[i][0], cnt[i][1], color = 'black')

Output

Time taken :  0.01283279599999787

The advantage of Mini Batch K means

It can handle large datasets as compared to the K−Means algorithm
It is computationally less expensive
It converges faster

Conclusion

Mini−batch K−Means is a better and newer approach to traditional Kmeans and solves some of its shortcomings like using less memory, handling large datasets in memory, and less time in convergence.

Mithilesh Pradhan

Updated on: 27-Aug-2023

390 Views

Kickstart Your Career

Get certified by completing the course

Get Started