Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
homogeneity_score using sklearn in Python
While working with clustering algorithms in Python, it is important to be able to evaluate the performance of the models, and one of the popular metrics for evaluating the performance of the clustering model is the homogeneity score using sklearn. It measures how well the labels assigned by a clustering algorithm match the true labels of a dataset. The higher the homogeneity score, the better the clustering algorithm performed.
In this article, we'll take a closer look at the homogeneity score and how to compute it using Scikit-learn in Python.
What is the Homogeneity Score?
The homogeneity score is a metric used to evaluate the performance of clustering models. It measures how well the labels of a given clustering model match with the true labels of the dataset. A clustering result is considered homogeneous if all clusters contain only data points from a single class.
To understand the working of the homogeneity score, consider a clustering algorithm like K-means that partitions a dataset into multiple clusters. If the algorithm does a good job of separating the data points into distinct groups, then the homogeneity score will be high. On the other hand, if the algorithm assigns data points to the wrong clusters, then the homogeneity score will be low.
Syntax
sklearn.metrics.homogeneity_score(labels_true, labels_pred)
Parameters
| Parameter | Type | Description |
|---|---|---|
labels_true |
array-like, shape (n_samples,) | Ground truth class labels to be used as reference |
labels_pred |
array-like, shape (n_samples,) | Cluster labels to evaluate |
Return Value
This function returns a float value between 0.0 and 1.0, where 1.0 stands for perfectly homogeneous labeling.
Computing Homogeneity Score with Random Data
To compute the homogeneity score in Python using sklearn, we use the homogeneity_score function. Here's an example using randomly generated data ?
from sklearn.metrics.cluster import homogeneity_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate a random dataset
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
# Perform clustering using KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)
# Compute the homogeneity score
homo_score = homogeneity_score(y_true, y_pred)
print("Homogeneity score:", homo_score)
Homogeneity score: 1.0
Computing Homogeneity Score with Iris Dataset
Let's use the famous Iris dataset to demonstrate homogeneity score calculation with real-world data ?
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score
# Load the iris dataset
iris = load_iris()
X = iris.data
y_true = iris.target
# Perform clustering using KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)
# Compute the homogeneity score
homo_score = homogeneity_score(y_true, y_pred)
print("Homogeneity score:", homo_score)
print("Number of samples:", len(y_true))
print("Number of clusters:", len(set(y_pred)))
Homogeneity score: 0.7514854021988338 Number of samples: 150 Number of clusters: 3
Interpreting Homogeneity Score
The homogeneity score ranges from 0 to 1:
- Score = 1.0: Perfect homogeneity - each cluster contains only members of a single class
- Score = 0.0: Poor homogeneity - clusters contain mixed classes
- Score between 0 and 1: Partial homogeneity - some clusters are pure, others are mixed
Comparing Different Clustering Results
Let's compare homogeneity scores for different numbers of clusters ?
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score
# Load the iris dataset
iris = load_iris()
X = iris.data
y_true = iris.target
# Test different numbers of clusters
for n_clusters in [2, 3, 4, 5]:
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
y_pred = kmeans.fit_predict(X)
score = homogeneity_score(y_true, y_pred)
print(f"Clusters: {n_clusters}, Homogeneity Score: {score:.4f}")
Clusters: 2, Homogeneity Score: 0.6896 Clusters: 3, Homogeneity Score: 0.7515 Clusters: 4, Homogeneity Score: 0.7792 Clusters: 5, Homogeneity Score: 0.8059
Conclusion
The homogeneity score is a valuable metric for evaluating clustering performance by measuring how well clusters contain only data points from a single class. Using sklearn's homogeneity_score function, we can easily assess whether our clustering algorithm successfully separates different classes into distinct clusters.
