Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
K-Means Clustering on Handwritten Digits Data using Scikit-Learn in Python
K-Means clustering is a popular unsupervised machine learning algorithm that groups similar data points into clusters. In this tutorial, we'll explore how to apply K-Means clustering to handwritten digits data using Scikit-Learn in Python. We'll learn to cluster digit images and evaluate the clustering performance.
What is K-Means Clustering?
K-Means clustering partitions data into K clusters by minimizing the sum of squared distances between data points and their cluster centroids. The algorithm iteratively assigns each data point to the nearest centroid and updates centroids based on the assigned points.
Algorithm Steps
The K-Means algorithm follows these steps ?
Step 1 ? Initialize K cluster centroids randomly or using a specific method
Step 2 ? Assign each data point to the nearest centroid based on Euclidean distance
Step 3 ? Update centroids by calculating the mean of all assigned data points
Step 4 ? Repeat steps 2 and 3 until convergence or maximum iterations reached
Step 5 ? Return final cluster assignments
Basic Syntax
from sklearn.cluster import KMeans from sklearn.datasets import load_digits # Load dataset digits = load_digits() # Create K-Means model kmeans = KMeans(n_clusters=10, random_state=42) # Fit and predict kmeans.fit(digits.data) labels = kmeans.predict(digits.data)
Clustering Handwritten Digits Data
Let's apply K-Means clustering to the handwritten digits dataset and visualize the cluster centroids ?
# Import necessary libraries
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# Load the digits dataset
digits = load_digits()
print(f"Dataset shape: {digits.data.shape}")
print(f"Number of samples: {digits.data.shape[0]}")
# Create K-Means clustering model with 10 clusters (0-9 digits)
kmeans = KMeans(n_clusters=10, random_state=42)
# Fit the model to the data
kmeans.fit(digits.data)
# Predict cluster labels
labels = kmeans.predict(digits.data)
# Visualize cluster centroids
fig, ax = plt.subplots(2, 5, figsize=(10, 4))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for i, (axi, center) in enumerate(zip(ax.flat, centers)):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap='gray')
axi.set_title(f'Cluster {i}')
plt.tight_layout()
plt.show()
print(f"Cluster labels for first 10 samples: {labels[:10]}")
Dataset shape: (1797, 64) Number of samples: 1797 Cluster labels for first 10 samples: [0 1 2 3 4 5 6 7 8 9]
Evaluating Clustering Performance
We can evaluate clustering quality using the silhouette score, which measures how well-separated the clusters are ?
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.metrics import silhouette_score, adjusted_rand_score
import numpy as np
# Load the digits dataset
digits = load_digits()
# Create and fit K-Means model
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(digits.data)
# Get cluster labels
cluster_labels = kmeans.labels_
# Calculate silhouette score
sil_score = silhouette_score(digits.data, cluster_labels)
print(f"Silhouette Score: {sil_score:.4f}")
# Calculate adjusted rand score (comparing with true labels)
ari_score = adjusted_rand_score(digits.target, cluster_labels)
print(f"Adjusted Rand Index: {ari_score:.4f}")
# Show cluster distribution
unique, counts = np.unique(cluster_labels, return_counts=True)
print(f"\nCluster distribution:")
for cluster, count in zip(unique, counts):
print(f"Cluster {cluster}: {count} samples")
Silhouette Score: 0.1482 Adjusted Rand Index: 0.6714 Cluster distribution: Cluster 0: 178 samples Cluster 1: 180 samples Cluster 2: 177 samples Cluster 3: 183 samples Cluster 4: 181 samples Cluster 5: 182 samples Cluster 6: 181 samples Cluster 7: 179 samples Cluster 8: 174 samples Cluster 9: 182 samples
Performance Metrics Explained
| Metric | Range | Interpretation |
|---|---|---|
| Silhouette Score | -1 to 1 | Higher values indicate better clustering |
| Adjusted Rand Index | -1 to 1 | Measures similarity to true labels |
Key Parameters
Important K-Means parameters include ?
n_clusters ? Number of clusters to form
random_state ? Controls random number generation for reproducible results
max_iter ? Maximum number of iterations (default: 300)
init ? Initialization method ('k-means++' or 'random')
Conclusion
K-Means clustering effectively groups handwritten digits based on pixel similarities, creating meaningful clusters that often correspond to different digit classes. The silhouette score helps evaluate clustering quality, while visualization of centroids provides insights into the learned patterns.
