homogeneity_score using sklearn in Python


While working with clustering algorithms in Python, it is important to be able to evaluate the performance of the models, and one of the popular metrics for evaluating the performance of the clustering model is the homogeneity score using sklearn. It measures how well the labels assigned by a clustering algorithm match the true labels of a dataset. The higher the homogeneity score, the better the clustering algorithm performed.

In this article, we'll take a closer look at the homogeneity score and how to compute it using Scikit-learn in Python.

What is the Homogeneity score?

The homogeneity score is the metric that is used to evaluate the performance of the clustering model which is the set of labels. It measures how well the labels of a given clustering model match with the true labels of the particular dataset.

To understand the working of the homogeneity score, consider a clustering algorithm like the K-means clustering algorithm that partitions a dataset into multiple clusters. If the algorithm does a good job of separating the data points into distinct groups, then the homogeneity score will be high. On the other hand, if the algorithm assigns data points to the wrong clusters or groups, then the homogeneity score will be low.

Syntax

sklearn.metrics.homogeneity_score(labels_true, labels_pred)

Parameters

S.no Parameters Definition
1 labels_trueint array, shape = [n_samples] For reference Ground truth class labels will be used.
2 labels_predarray-like of shape (n_samples,) To evaluate clusters of labels.

This function returns homogeneityfloat which is the score between 0.0 and 1.0 where 1.0 stands for the perfectly homogeneous labeling.

How to compute the homogeneity score in Python?

To compute the homogeneity_score in Python using Sklearn, we can use the homogeneity_score function from the module sklearn.metrics.cluster. Below is the example to compute the homogeneity score in Python by generating a random dataset using make_blobs −

Program to calculate homogeneity score using random data

from sklearn.metrics.cluster import homogeneity_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate a random dataset
X1, y1 = make_blobs(n_samples=2000, centers=6 ,random_state=50)

# Perform clustering using KMeans
kmeans1 = KMeans(n_clusters=6, random_state=50)
labels1 = kmeans1.fit_predict(X1)

# Compute the homogeneity score
homo_score = homogeneity_score(y1, labels1)

print("Homogeneity score:", homo_score)

Output

Homogeneity score: 0.8845679179458327

In the above example, we first generate a random dataset by using the function make-blobs from Scikit-learn. Then, we performed clustering using the KMeans algorithm with 5 clusters. Finally, we computed the homogeneity score using the homogeneity_score function, we will be passing the true labels y and the predicted labels labels as the arguments.

Program to calculate homogeneity score using the inbuilt dataset(iris)

For this example, we'll use the iris dataset that comes with Scikit-learn. We will cluster the sample or the dataset based on their features and evaluate the performance of the model using the homogeneity score.

Follow the below steps to calculate the homogeneity score using the inbuilt dataset(iris) −

  • Load the iris dataset using the function in Scikit-learn known as load_iris function from Scikit-learn.

  • Extract the data and the true labels from the dataset.

  • Perform clustering using the KMeans algorithm with three clusters (since there are three classes in the iris dataset).

  • Compute the homogeneity score using the homogeneity_score function from Scikit-learn, passing the true labels y_true and the predicted labels y_pred as arguments.

Below is the code to load the data and then compute the homogeneity score using the KMeans clustering −

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

from sklearn.metrics import homogeneity_score

# Load the iris dataset
iris_df = load_iris()
X1 = iris_df.data
y1_true = iris_df.target

# Perform clustering using KMeans
kmeans = KMeans(n_clusters=3, random_state=50)
y1_pred = kmeans.fit_predict(X1)

# Compute the homogeneity score
homo_score = homogeneity_score(y1_true, y1_pred)

print("Homogeneity score:", homo_score)

Output

Homogeneity score: 0.7514854021988338

When you run this code, you should see the homogeneity score printed on the console. The homogeneity score will be a value between 0 and 1, with higher values indicating better clustering performance.

Conclusion

In conclusion, the homogeneity score is an important and useful metric for evaluating the performance of the clustering algorithms like KMeans clustering. We learned that by computing the homogeneity score, we can determine how well a given clustering algorithm separates or clusters the dataset into distinct groups. In Python, we can use the homogeneity_score function from Scikit-learn to compute the homogeneity score for a given set of labels.

Updated on: 24-Jul-2023

440 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements