K-Means Clustering on Handwritten Digits Data using Scikit-Learn in Python


Introduction

Clustering, which groups similar bits of data based on shared characteristics, is a prominent technique in unsupervised machine learning. K-Means clustering is a popular clustering algorithm. Data is divided into K clusters using the iterative K-Means technique, where K is a predetermined number. The process minimizes the sum of squared distances between the cluster centroids and the data points. In this post, we will look at how to use the Python Scikit-Learn package to conduct K-Means clustering on handwritten digits data.

Definition

A straightforward and efficient unsupervised learning approach called K-Means clustering seeks to separate a dataset into K unique, non-overlapping clusters. Each data point is given its closest centroid, which is the arithmetic mean of all the points assigned to that cluster, and this is how it operates. In order to reduce the sum of squared distances between data points and their corresponding centroids, the algorithm then iteratively changes the centroids. Up to convergence or a predetermined number of iterations, the operation is repeated.

Syntax

from sklearn.cluster import KMeans

# Load the digits dataset
from sklearn.datasets import load_digits
digits = load_digits()

# Create a K-Means clustering model
kmeans = KMeans(n_clusters=K)

# Fit the model to the data
kmeans.fit(digits.data)

# Predict the cluster labels for the data
labels = kmeans.predict(digits.data)
  • Bring in the required libraries. From the sklearn.cluster package we import The KMeans class . We also import the handwritten digits dataset via the load_digits method from the sklearn.datasets package.

  • Use the load_digits function to load the dataset of handwritten digits. This collection includes pictures of handwritten numbers, with each number being an 8x8 pixel image.

  • Initialise a KMeans class instance to produce a K-Means clustering model. The number of clusters (K) we wish to generate are specified by the n_clusters parameter. Depending on the dataset and problem, we can choose any K value.

  • Call the fit method, passing the dataset, to fit the K-Means model to the data. The cluster centroids are determined in this step, and each data point is mapped to the closest centroid.

  • Utilising the predict approach, forecast the labels for the data points' clusters. Each piece of data is given a label that corresponds to the cluster to which it belongs.

Algorithm

  • Step 1 − K cluster centroids should be initialised either randomly or using a specified initialization method.

  • Step 2 − Based on the Euclidean distance, place each data point next to its nearest centroid.

  • Step 3 − Calculate the mean of all the data points assigned to each cluster to update the centroids.

  • Step 4 − Up until convergence or the allotted number of iterations, repeat steps 2 and 3 as necessary.

  • Step 5 − We get the final cluster assignments.

Approach

  • Approach 1 − Clustering Handwritten Digits Data

  • Approach 2 − Evaluating Clustering Performance

Approach 1: Clustering Handwritten Digits Data

Example

# Import the necessary libraries
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()

# Create a K-Means clustering model
kmeans = KMeans(n_clusters=10)

# Fit the model to the data
kmeans.fit(digits.data)

# Predict the cluster labels for the data
labels = kmeans.predict(digits.data)

# Visualize the cluster centroids
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks=[])
   axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

# Show the plot
plt.show()

Output

As there are 10 different digits (0-9) in the dataset, we next establish a K-Means clustering model by initialising an instance of the KMeans class with n_clusters=10.

Then, using the fit approach, which determines the cluster centroids and places each data point in relation to its closest centroid, we fit the model to the data.

We resize the cluster centres into 8x8 images and plot them using matplotlib in order to see the cluster centroids. The representative photos for each cluster are displayed in the figure that results.

Here we use plt.show() to show the plot at the end of the program.

A plot with 10 subplots organized in a 2x5 grid is the code's output. For a given digit, each subplot corresponds to a cluster centroid. Since initialization is all at random, the output of the given program might differ each time when it is executed.The grayscale centroid image depicts the average digit image for each cluster. Because the cluster centroids' initial positions are chosen at random, the clustering outcomes may vary slightly. As a result, the cluster centroids that are produced and how they are arranged in the plot may alter between different runs.

Approach 2: Evaluating Clustering Performance

Example

# Import the necessary libraries
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.metrics import silhouette_score

# Load the digits dataset
digits = load_digits()

# Create a K-Means clustering model
kmeans = KMeans(n_clusters=10)

# Fit the model to the data
kmeans.fit(digits.data)

# Predict the cluster labels for the data
labels = kmeans.predict(digits.data)

# Evaluate the clustering performance
score = silhouette_score(digits.data, labels)
print("Silhouette Score:", score)

Output

Silhouette Score: 0.18185624794421412

After that, we fit a K-Means clustering model to the data with n_clusters set to 10.

The data points' predicted cluster labels are then added to the labels variable.

We utilize the silhouette_score function from sklearn.metrics to assess the performance of the clustering. Values for this statistic, which assesses how closely the data points are clustered, range from -1 to 1. Greater values represent improved clustering performance.

To evaluate the calibre of the clustering results, we publish the silhouette score in the final step.

The output is displayed after the colon, the real score value will be displayed. A higher score on the silhouette scale, which ranges from -1 to 1, indicates better grouping outcomes. It evaluates how well, in relation to other clusters, each sample in the dataset fits its allocated cluster. The clusters are more prominent and well-separated the closer the score gets to 1.When you run the code, the K-Means algorithm's clustering of the digits dataset will be used to get the silhouette score. The output will be the score, which will be printed to the console.

Conclusion

K-Means clustering is a flexible approach that may be used to find hidden patterns and group related data points together in a variety of forms of data. You may extract important insights from your data and make wise judgements by comprehending and using K-Means clustering.

Updated on: 13-Oct-2023

156 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements