Clustering Methods with SciPy


Clustering is a technique in machine learning and data science that involves grouping together similar data points or objects into clusters or subsets. The goal of clustering is to find patterns and structures in data that may not be immediately apparent, and to group related data points together which can be used for further analysis. In this article, we are going to see how to implement clustering with the help of the SciPy library.

SciPy provides us with various scientific computing tools to perform tasks like numerical integration, optimization, linear algebra, signal processing etc. It's used by researchers, scientists, engineers, and data analysts to perform complex calculations and analysis in their work. It is built on top of NumPy and consists of a submodule for clustering.

Some clustering algorithms which can be implemented using SciPy are:

  • K-Means − Here the aim is to divide a dataset into k clusters where k is a fixed number and each data point belongs to a cluster whose mean (or centroid) is closest to it.

  • Hierarchical − Here we create a hierarchy of clusters that can be represented as a dendrogram. They are further divided into 2 types which are agglomerative clustering and divisive clustering.

Each of these methods has its own strengths and weaknesses, and the choice of which one to use will depend on the characteristics of the data and the goals of the clustering. The scikit-learn library also provides clustering algorithms, with more advanced features like Gaussian Mixture Model, Bayesian Gaussian Mixture Model, etc.

K-Means Clustering with SciPy

The K-Means algorithm works by first randomly assigning k centroids to the dataset, and then iteratively reassigning data points to the closest centroid and updating the centroid based on the new cluster. This process is repeated until the clusters converge or a maximum number of iterations is reached. The SciPy library provides an implementation of the k-means algorithm in the scipy.cluster.vq module.

The dataset (kmeans_dataset.csv) used is available here.

Example

import pandas as pd
df = pd.read_csv("kmeans_dataset.csv")
X = df.values
from scipy.cluster.vq import kmeans,vq

# number of clusters
k = 4

# compute k-means clustering
centroids,_ = kmeans(X,k)

# a cluster for each data point
clusters,_ = vq(X,centroids)

# Plotting the data points in the clusters
import matplotlib.pyplot as plt
colors = ['r','g','b','y']
for i in range(k):
   # select only data observations with cluster label == i
   ds = X[np.where(clusters==i)]
   # plot the data observations
   plt.scatter(ds[:,0],ds[:,1],c=colors[i])
   # plot the centroids
   plt.scatter(centroids[i,0],centroids[i,1],marker='x',s=200, c='black')
plt.show()

Output

The above code will group the data points into 4 clusters and will plot the data points with different colours according to their cluster assignment. The cluster centroids are represented by 'x' markers.

One can adjust the number of clusters to suit your data and problem.

In this example, I have used the dataset whose link is given above, and then using the k-means algorithm for clustering and visualizing the results.

Keep in mind that the k-means algorithm is sensitive to initial conditions, so the results may vary if you run it multiple times with different initial centroids.

Hierarchical Clustering with SciPy

Hierarchical clustering is a method of clustering that creates a hierarchy of clusters, where each cluster is a subset of the previous one. The hierarchy is represented as a tree-like structure called a dendrogram. It is a powerful method for exploring and visualizing the structure of large datasets. Still, it can be computationally expensive for large datasets and sensitive to the linkage method used.

Example

from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree

# sample data points
data = [[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]

# create linkage matrix
Z = linkage(data, method='ward')

# create dendrogram
dendrogram(Z)

# cut the dendrogram at a threshold to obtain clusters
clusters = cut_tree(Z, height=2)

import matplotlib.pyplot as plt
plt.figure()
dendrogram(Z, labels = ["data1","data2","data3","data4","data5","data6"])
plt.show()

Output

The above code will group the data points into clusters using the linkage method 'ward', which minimizes the variance of the distances between the clusters being linked. The dendrogram function is used to plot the dendrogram, which is a visualization of the hierarchical clustering solution. The cut_tree function is used to extract the clusters from the dendrogram at a given threshold. The output of the cut_tree function is a list of cluster labels for each data point. It is also possible to visualize the dendrogram using the matplotlib library and customize the appearance, such as the colour and size of the lines, labels, etc.

Conclusion

SciPy is not available for all types of clustering however it is efficient to perform k-means as well as Hierarchical clustering. SciPy's k-means algorithm is a simple and efficient method for partitioning a dataset into a fixed number of clusters. Hierarchical clustering is a method that creates a hierarchy of clusters, where each cluster is a subset of the previous one. Widely used algorithms like DBSCAN cannot be implemented using SciPy.

Hence if you are looking for a wide range of clustering algorithms, with built-in support for pre-processing, evaluating and more flexibility, scikit-learn is the best choice.

Updated on: 04-Oct-2023

72 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements