
- Machine Learning With Python
- Home
- Basics
- Python Ecosystem
- Methods for Machine Learning
- Data Loading for ML Projects
- Understanding Data with Statistics
- Understanding Data with Visualization
- Preparing Data
- Data Feature Selection
- ML Algorithms - Classification
- Introduction
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- Naïve Bayes
- Random Forest
- ML Algorithms - Regression
- Random Forest
- Linear Regression
- ML Algorithms - Clustering
- Overview
- K-means Algorithm
- Mean Shift Algorithm
- Hierarchical Clustering
- ML Algorithms - KNN Algorithm
- Finding Nearest Neighbors
- Performance Metrics
- Automatic Workflows
- Improving Performance of ML Models
- Improving Performance of ML Model (Contd…)
- ML With Python - Resources
- Machine Learning With Python - Quick Guide
- Machine Learning with Python - Resources
- Machine Learning With Python - Discussion
ML - Analysis of Silhouette Score
The range of Silhouette score is [-1, 1]. Its analysis is as follows −
+1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster.
0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters.
-1 Score − 1 Silhouette score indicates that the samples have been assigned to the wrong clusters.
The calculation of Silhouette score can be done by using the following formula
$$silhouette score\:=\:(p-q)/max(p,q)$$
Here, p = mean distance to the points in the nearest cluster
And, q = mean intra-cluster distance to all the points.
Davis-Bouldin Index
DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model −
- Weather the clusters are well-spaced from each other or not?
- How much dense the clusters are?
We can calculate DB index with the help of following formula −
$$DB\:=\:\frac{1}{n}\displaystyle\sum\limits_{i=1}^n max_{j\neq\:i}(\frac{\sigma_{i}+\sigma_{j}}{d(c_{i},c_{j})})$$
Here, n = number of clusters
$\sigma_{i}$ = average distance of all points in cluster 𝑖 from the cluster centroid $c_{i}$.
Less the DB index, better the clustering model is.
Dunn Index
It works same as DB index but there are following points in which both differs −
The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model.
Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense.
We can calculate Dunn index with the help of following formula −
$$D\:=\:\frac{min_{1\leq\:i\leq\:j\leq\:n}p(i,j)}{max_{1\leq\:i\leq\:k\leq\:n}q(k)}$$
Here i,j,k = each indices for clusters
p = inter-cluster distance
q = intra-cluster distance