Article Categories

Selected Reading

K-Means Clustering on Handwritten Digits Data using Scikit-Learn in Python

Python Server Side Programming Programming

K-Means clustering is a popular unsupervised machine learning algorithm that groups similar data points into clusters. In this tutorial, we'll explore how to apply K-Means clustering to handwritten digits data using Scikit-Learn in Python. We'll learn to cluster digit images and evaluate the clustering performance.

What is K-Means Clustering?

K-Means clustering partitions data into K clusters by minimizing the sum of squared distances between data points and their cluster centroids. The algorithm iteratively assigns each data point to the nearest centroid and updates centroids based on the assigned points.

Algorithm Steps

The K-Means algorithm follows these steps ?

Step 1 ? Initialize K cluster centroids randomly or using a specific method
Step 2 ? Assign each data point to the nearest centroid based on Euclidean distance
Step 3 ? Update centroids by calculating the mean of all assigned data points
Step 4 ? Repeat steps 2 and 3 until convergence or maximum iterations reached
Step 5 ? Return final cluster assignments

Basic Syntax

from sklearn.cluster import KMeans
from sklearn.datasets import load_digits

# Load dataset
digits = load_digits()

# Create K-Means model
kmeans = KMeans(n_clusters=10, random_state=42)

# Fit and predict
kmeans.fit(digits.data)
labels = kmeans.predict(digits.data)

Clustering Handwritten Digits Data

Let's apply K-Means clustering to the handwritten digits dataset and visualize the cluster centroids ?

# Import necessary libraries
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
print(f"Dataset shape: {digits.data.shape}")
print(f"Number of samples: {digits.data.shape[0]}")

# Create K-Means clustering model with 10 clusters (0-9 digits)
kmeans = KMeans(n_clusters=10, random_state=42)

# Fit the model to the data
kmeans.fit(digits.data)

# Predict cluster labels
labels = kmeans.predict(digits.data)

# Visualize cluster centroids
fig, ax = plt.subplots(2, 5, figsize=(10, 4))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)

for i, (axi, center) in enumerate(zip(ax.flat, centers)):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap='gray')
    axi.set_title(f'Cluster {i}')

plt.tight_layout()
plt.show()

print(f"Cluster labels for first 10 samples: {labels[:10]}")

Dataset shape: (1797, 64)
Number of samples: 1797
Cluster labels for first 10 samples: [0 1 2 3 4 5 6 7 8 9]

Evaluating Clustering Performance

We can evaluate clustering quality using the silhouette score, which measures how well-separated the clusters are ?

from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.metrics import silhouette_score, adjusted_rand_score
import numpy as np

# Load the digits dataset
digits = load_digits()

# Create and fit K-Means model
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(digits.data)

# Get cluster labels
cluster_labels = kmeans.labels_

# Calculate silhouette score
sil_score = silhouette_score(digits.data, cluster_labels)
print(f"Silhouette Score: {sil_score:.4f}")

# Calculate adjusted rand score (comparing with true labels)
ari_score = adjusted_rand_score(digits.target, cluster_labels)
print(f"Adjusted Rand Index: {ari_score:.4f}")

# Show cluster distribution
unique, counts = np.unique(cluster_labels, return_counts=True)
print(f"\nCluster distribution:")
for cluster, count in zip(unique, counts):
    print(f"Cluster {cluster}: {count} samples")

Silhouette Score: 0.1482
Adjusted Rand Index: 0.6714

Cluster distribution:
Cluster 0: 178 samples
Cluster 1: 180 samples
Cluster 2: 177 samples
Cluster 3: 183 samples
Cluster 4: 181 samples
Cluster 5: 182 samples
Cluster 6: 181 samples
Cluster 7: 179 samples
Cluster 8: 174 samples
Cluster 9: 182 samples

Performance Metrics Explained

Metric	Range	Interpretation
Silhouette Score	-1 to 1	Higher values indicate better clustering
Adjusted Rand Index	-1 to 1	Measures similarity to true labels

Key Parameters

Important K-Means parameters include ?

n_clusters ? Number of clusters to form
random_state ? Controls random number generation for reproducible results
max_iter ? Maximum number of iterations (default: 300)
init ? Initialization method ('k-means++' or 'random')

Conclusion

K-Means clustering effectively groups handwritten digits based on pixel similarities, creating meaningful clusters that often correspond to different digit classes. The silhouette score helps evaluate clustering quality, while visualization of centroids provides insights into the learned patterns.

Arpana Jain

Updated on: 2026-03-27T15:05:11+05:30

704 Views

Previous Next