- Scikit Learn Tutorial
- Scikit Learn - Home
- Scikit Learn - Introduction
- Scikit Learn - Modelling Process
- Scikit Learn - Data Representation
- Scikit Learn - Estimator API
- Scikit Learn - Conventions
- Scikit Learn - Linear Modeling
- Scikit Learn - Extended Linear Modeling
- Stochastic Gradient Descent
- Scikit Learn - Support Vector Machines
- Scikit Learn - Anomaly Detection
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - KNN Learning
- Classification with Naïve Bayes
- Scikit Learn - Decision Trees
- Randomized Decision Trees
- Scikit Learn - Boosting Methods
- Scikit Learn - Clustering Methods
- Clustering Performance Evaluation
- Dimensionality Reduction using PCA
- Scikit Learn Useful Resources
- Scikit Learn - Quick Guide
- Scikit Learn - Useful Resources
- Scikit Learn - Discussion

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

# Scikit Learn - Clustering Performance Evaluation

There are various functions with the help of which we can evaluate the performance of clustering algorithms.

Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance −

## Adjusted Rand Index

Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula −

$$Adjusted\:RI=\left(RI-Expected_{-}RI\right)/\left(max\left(RI\right)-Expected_{-}RI\right)$$It has two parameters namely **labels_true**, which is ground truth class labels, and **labels_pred**, which are clusters label to evaluate.

### Example

from sklearn.metrics.cluster import adjusted_rand_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_rand_score(labels_true, labels_pred)

### Output

0.4444444444444445

Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0 or negative.

## Mutual Information Based Score

Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available −

### Normalized Mutual Information (NMI)

Scikit learn have **sklearn.metrics.normalized_mutual_info_score** module.

### Example

from sklearn.metrics.cluster import normalized_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] normalized_mutual_info_score (labels_true, labels_pred)

### Output

0.7611702597222881

### Adjusted Mutual Information (AMI)

Scikit learn have **sklearn.metrics.adjusted_mutual_info_score** module.

### Example

from sklearn.metrics.cluster import adjusted_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_mutual_info_score (labels_true, labels_pred)

### Output

0.4444444444444448

## Fowlkes-Mallows Score

The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall.

Mathematically,

$$FMS=\frac{TP}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)}}$$Here, **TP = True Positive** − number of pair of points belonging to the same clusters in true as well as predicted labels both.

**FP = False Positive** − number of pair of points belonging to the same clusters in true labels but not in the predicted labels.

**FN = False Negative** − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels.

The Scikit learn has sklearn.metrics.fowlkes_mallows_score module −

### Example

from sklearn.metrics.cluster import fowlkes_mallows_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] fowlkes_mallows__score (labels_true, labels_pred)

### Output

0.6546536707079771

## Silhouette Coefficient

The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample.

Mathematically,

$$S=\left(b-a\right)/max\left(a,b\right)$$Here, a is intra-cluster distance.

and, b is mean nearest-cluster distance.

The Scikit learn have **sklearn.metrics.silhouette_score** module −

### Example

from sklearn import metrics.silhouette_score from sklearn.metrics import pairwise_distances from sklearn import datasets import numpy as np from sklearn.cluster import KMeans dataset = datasets.load_iris() X = dataset.data y = dataset.target kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X) labels = kmeans_model.labels_ silhouette_score(X, labels, metric = 'euclidean')

### Output

0.5528190123564091

## Contingency Matrix

This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix.

The Scikit learn have **sklearn.metrics.contingency_matrix** module.

### Example

from sklearn.metrics.cluster import contingency_matrix x = ["a", "a", "a", "b", "b", "b"] y = [1, 1, 2, 0, 1, 2] contingency_matrix(x, y)

### Output

array([ [0, 2, 1], [1, 1, 1] ])

The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.