Article Categories

Selected Reading

homogeneity_score using sklearn in Python

Python Server Side Programming Programming

While working with clustering algorithms in Python, it is important to be able to evaluate the performance of the models, and one of the popular metrics for evaluating the performance of the clustering model is the homogeneity score using sklearn. It measures how well the labels assigned by a clustering algorithm match the true labels of a dataset. The higher the homogeneity score, the better the clustering algorithm performed.

In this article, we'll take a closer look at the homogeneity score and how to compute it using Scikit-learn in Python.

What is the Homogeneity Score?

The homogeneity score is a metric used to evaluate the performance of clustering models. It measures how well the labels of a given clustering model match with the true labels of the dataset. A clustering result is considered homogeneous if all clusters contain only data points from a single class.

To understand the working of the homogeneity score, consider a clustering algorithm like K-means that partitions a dataset into multiple clusters. If the algorithm does a good job of separating the data points into distinct groups, then the homogeneity score will be high. On the other hand, if the algorithm assigns data points to the wrong clusters, then the homogeneity score will be low.

Syntax

sklearn.metrics.homogeneity_score(labels_true, labels_pred)

Parameters

Parameter	Type	Description
`labels_true`	array-like, shape (n_samples,)	Ground truth class labels to be used as reference
`labels_pred`	array-like, shape (n_samples,)	Cluster labels to evaluate

Return Value

This function returns a float value between 0.0 and 1.0, where 1.0 stands for perfectly homogeneous labeling.

Computing Homogeneity Score with Random Data

To compute the homogeneity score in Python using sklearn, we use the homogeneity_score function. Here's an example using randomly generated data ?

from sklearn.metrics.cluster import homogeneity_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate a random dataset
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

# Perform clustering using KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Compute the homogeneity score
homo_score = homogeneity_score(y_true, y_pred)

print("Homogeneity score:", homo_score)

Homogeneity score: 1.0

Computing Homogeneity Score with Iris Dataset

Let's use the famous Iris dataset to demonstrate homogeneity score calculation with real-world data ?

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y_true = iris.target

# Perform clustering using KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)

# Compute the homogeneity score
homo_score = homogeneity_score(y_true, y_pred)

print("Homogeneity score:", homo_score)
print("Number of samples:", len(y_true))
print("Number of clusters:", len(set(y_pred)))

Homogeneity score: 0.7514854021988338
Number of samples: 150
Number of clusters: 3

Interpreting Homogeneity Score

The homogeneity score ranges from 0 to 1:

Score = 1.0: Perfect homogeneity - each cluster contains only members of a single class
Score = 0.0: Poor homogeneity - clusters contain mixed classes
Score between 0 and 1: Partial homogeneity - some clusters are pure, others are mixed

Comparing Different Clustering Results

Let's compare homogeneity scores for different numbers of clusters ?

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y_true = iris.target

# Test different numbers of clusters
for n_clusters in [2, 3, 4, 5]:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    y_pred = kmeans.fit_predict(X)
    score = homogeneity_score(y_true, y_pred)
    print(f"Clusters: {n_clusters}, Homogeneity Score: {score:.4f}")

Clusters: 2, Homogeneity Score: 0.6896
Clusters: 3, Homogeneity Score: 0.7515
Clusters: 4, Homogeneity Score: 0.7792
Clusters: 5, Homogeneity Score: 0.8059

Conclusion

The homogeneity score is a valuable metric for evaluating clustering performance by measuring how well clusters contain only data points from a single class. Using sklearn's homogeneity_score function, we can easily assess whether our clustering algorithm successfully separates different classes into distinct clusters.

Priya Mishra

Updated on: 2026-03-27T07:50:21+05:30

1K+ Views

Previous Next