Clustering Performance Evaluation in Scikit Learn


Clustering is a fundamental unsupervised learning technique that aims to discover patterns or groupings in unlabeled data. It plays a crucial role in various domains such as data mining, pattern recognition, and customer segmentation. However, once clustering algorithms are applied, it becomes essential to evaluate their performance and assess the quality of the resulting clusters.

Clustering performance evaluation is a critical step in understanding the effectiveness and reliability of clustering algorithms. It involves quantifying the quality of the obtained clusters and providing insights into their consistency and separability. By evaluating clustering results, practitioners can make informed decisions about algorithm selection, parameter tuning, and interpretability of the discovered clusters.

In this article, we will explore the concept of clustering performance evaluation using the Scikit−Learn library in Python.

To illustrate the concept of clustering performance evaluation, let's consider an example where we perform clustering on a dataset.

Consider the code shown below.

Example

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate random points
features, targets = make_blobs(n_samples=500, centers=5, random_state=42, shuffle=False)

# Create the scatter plot
plt.scatter(features[:, 0], features[:, 1])

# Customize plot appearance
plt.title("Random Points Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Display the plot
plt.show()

Output

K−Means

In the below example, we will make use of the k−means algorithm.

Consider the code shown below.

Example

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Generate sample data
X, y_true = make_blobs(n_samples=500, centers=4, random_state=42)

# Perform clustering using k-means algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Evaluate clustering performance using metrics
silhouette = silhouette_score(X, y_pred)
calinski_harabasz = calinski_harabasz_score(X, y_pred)
davies_bouldin = davies_bouldin_score(X, y_pred)

# Plot the clustering results
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', label='Centroids')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

# Print the evaluation scores
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Calinski-Harabasz Index: {calinski_harabasz:.3f}")
print(f"Davies-Bouldin Index: {davies_bouldin:.3f}")

Output

Performance Evaluation Indices

Silhouette Score

The Silhouette Score is a widely used metric to evaluate the quality of clustering results. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from −1 to 1, where a higher value indicates better clustering performance. A value close to 1 suggests that data points are well−clustered and properly separated, while a value close to −1 indicates that data points may have been assigned to the wrong clusters. In the code, the Silhouette Score is calculated using the silhouette_score() function.

Consider the code shown below.

Example

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate sample data
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Perform clustering using K-means algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Calculate the Silhouette Score
silhouette = silhouette_score(X, y_pred)

# Print the Silhouette Score
print("Silhouette Score:", silhouette)

Output

Silhouette Score: 0.7911042588289479

Calinski−Harabasz Index

The Calinski−Harabasz Index, also known as the Variance Ratio Criterion, is another performance evaluation metric for clustering. It measures the ratio of between−cluster dispersion to within−cluster dispersion. A higher Calinski−Harabasz Index value indicates better clustering performance, with a higher separation between clusters and lower variance within clusters. In the code, the Calinski−Harabasz Index is calculated using the calinski_harabasz_score() function.

Consider the code shown below.

Example

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score

# Generate sample data
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Perform clustering using K-means algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Calculate the Calinski-Harabasz Index
calinski_harabasz = calinski_harabasz_score(X, y_pred)

# Print the Calinski-Harabasz Index
print("Calinski-Harabasz Index:", calinski_harabasz)

Output

Calinski-Harabasz Index: 5742.035759058726

Conclusion

In conclusion, evaluating the performance of clustering algorithms is crucial to assess their effectiveness in grouping data points. In this article, we explored two commonly used performance evaluation metrics: the Silhouette Score and the Calinski−Harabasz Index.

The Silhouette Score measures the quality and separation of clusters by considering the average distance between samples within the same cluster and samples in other clusters. A higher Silhouette Score indicates better clustering performance, with well−separated and distinct clusters.

The Calinski−Harabasz Index evaluates the clustering performance by considering the ratio of between−cluster dispersion to within−cluster dispersion. A higher Calinski−Harabasz Index suggests better clustering performance, with higher separation between clusters and lower variance within clusters.

By utilising these evaluation metrics, we can quantitatively assess the quality of clustering results and make informed decisions about the choice of clustering algorithms and parameter settings.

Updated on: 07-Aug-2023

169 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements