K-Medoids clustering with solved example in Machine Learning

Machine Learning Python Server Side Programming Programming

Introduction

K-Medoids is an unsupervised clustering algorithm using the partition method of clustering. It is an improvised version of the K-Means clustering algorithm especially used to handle outlier data. It requires unlabeled data to work with.

In this article let us understand the k-Medoids algorithm with an example.

K-Medoids Algorithm

In the K-Medoids algorithm, each data point is called a medoid. The medoids serve as cluster centers. The medoid is a point such that its sum of the distance from all other points in the same cluster is minimum. For distance, any suitable metric like Euclidian distance or Manhattan distance can be used.

The complete data is divided into K Clusters after the algorithm is applied.

K-Medoids is of three types – PAM, CLARA, and CLARANS. PAM is the most popular method.It has one disadvantage that it takes a lot of time.

The K-Medoid is applied in such a way that

A single point can belong to only one cluster
Each cluster has a minimum one point

Let us see the working process of K-Medoids with an example.

Working

Initially, we have K as the number of clusters and D as our unlabelled data.

First, we choose K points from the dataset and assign them to K number of clusters. These K points act as initial medoids. Each object is taken into one cluster.
Next, the distance between the initial medoids (points) and other points(non-medoid) is calculated using either of the distance metrics like Euclidian distance, or Manhattan distance. etc.
The non-medoid points are assigned to that particular cluster to which its distance to the medoid point is minimum.
Now the total cost is calculated which is the sum of distances from other points to the medoid point within a cluster.
Next, a random new non-medoid object s is selected and is swapped with the initial medoid object r, and the cost is recalculated.
If costs < costr , then the swap becomes permanent.
Finally, the steps from 2 to 6 are repeated until there is no change in cost.

Example

Let us consider the following set of data. We will take k=2 and the distance formula to be used

$$\mathrm{D=\mid x_{2}-x_{1}\mid+\mid y_{2}-y_{1}\mid}$$

Sl. no	x	y
1	9	6
2	10	4
3	4	4
4	5	8
5	3	8
6	2	5
7	8	5
8	4	6
9	8	4
10	9	3

The data looks similar to the below figure on plotting.

For k = 2 let us take two random points P1(8,4) and P2(4,6) and calculate thie distances from other points.

Sl. no	x	y	Dist from P1 (8,4)	Dist from P2 (4,6)
1	9	6	3	5
2	10	4	2	8
3	4	4	4	2
4	5	8	7	3
5	3	8	9	3
6	2	5	7	3
7	8	5	1	5
8	4	6	-	-
9	8	4	-	-
10	9	3	2	8

1,2,7,10 - assigned to P1(8,4)

3,4,5,6 - assigned to P2(4,6)

Total cost involved C1 = (3+2+1+2) +(2+3+3+3) = 19

Now let us randomly select any other two points as medoids P1(8,5) and P2(4,6) and calculate the distances.

Sl. no	x	y	Dist from P1(8,5)	Dist from P2(4,6)
1	9	6	2	5
2	10	4	3	8
3	4	4	5	2
4	5	8	6	3
5	3	8	6	3
6	2	5	6	3
7	8	5	-	-
8	4	6	-	-
9	8	4	1	-
10	9	3	3	8

$$\mathrm{New\:cost\:involved\:C_{2}=[2+3+1+3]+[2+3+3+3]=20}$$

$$\mathrm{Total\:cost\:involved \:in\:swapping\:C=C_{2}-C_{1}=20-19=1}$$

Since the total cost involved in swapping C is greater than 0 we do revert the swap.

Points P1(8,4) and P2(4,6) are considered the final medoids and 2 clusters are formed using these points only.

Code Implementation

import numpy as np
from sklearn_extra.cluster import KMedoids

data = {'x' : [9,10,4,5,3,2,8,4,8,9],
'y' : [6,4,4,8,8,5,5,6,4,3]}

x = [[i,j] for i,j in zip(data['x'],data['y'])]

data_x = np.asarray(x)
model_km = KMedoids(n_clusters=2)
km = model_km.fit(data_x)
print("Labels :",km.labels_)
print("Cluster centers  :",km.cluster_centers_)

Output

Labels : [1 1 0 0 0 0 1 0 1 1]
Cluster centers  : [[4. 6.]
 [8. 4.]]

Conclusion

The K-Medoids is an improvised method over the M-Means algorithm. It is unsupervised and requires unlabelled data..K-Mediods being a distance-based method depends upon the cluster distances and within-cluster distances where medoids act as cluster centers and are taken as reference points to calculate the distances. It is highly useful by the fact that it can handle outliers quite effectively.

Mithilesh Pradhan

Updated on: 09-Aug-2023

817 Views

Kickstart Your Career

Get certified by completing the course

Get Started