- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Clustering Methods with SciPy
Clustering is a technique in machine learning and data science that involves grouping together similar data points or objects into clusters or subsets. The goal of clustering is to find patterns and structures in data that may not be immediately apparent, and to group related data points together which can be used for further analysis. In this article, we are going to see how to implement clustering with the help of the SciPy library.
SciPy provides us with various scientific computing tools to perform tasks like numerical integration, optimization, linear algebra, signal processing etc. It's used by researchers, scientists, engineers, and data analysts to perform complex calculations and analysis in their work. It is built on top of NumPy and consists of a submodule for clustering.
Some clustering algorithms which can be implemented using SciPy are:
K-Means − Here the aim is to divide a dataset into k clusters where k is a fixed number and each data point belongs to a cluster whose mean (or centroid) is closest to it.
Hierarchical − Here we create a hierarchy of clusters that can be represented as a dendrogram. They are further divided into 2 types which are agglomerative clustering and divisive clustering.
Each of these methods has its own strengths and weaknesses, and the choice of which one to use will depend on the characteristics of the data and the goals of the clustering. The scikit-learn library also provides clustering algorithms, with more advanced features like Gaussian Mixture Model, Bayesian Gaussian Mixture Model, etc.
K-Means Clustering with SciPy
The K-Means algorithm works by first randomly assigning k centroids to the dataset, and then iteratively reassigning data points to the closest centroid and updating the centroid based on the new cluster. This process is repeated until the clusters converge or a maximum number of iterations is reached. The SciPy library provides an implementation of the k-means algorithm in the scipy.cluster.vq module.
The dataset (kmeans_dataset.csv) used is available here.
Example
import pandas as pd df = pd.read_csv("kmeans_dataset.csv") X = df.values from scipy.cluster.vq import kmeans,vq # number of clusters k = 4 # compute k-means clustering centroids,_ = kmeans(X,k) # a cluster for each data point clusters,_ = vq(X,centroids) # Plotting the data points in the clusters import matplotlib.pyplot as plt colors = ['r','g','b','y'] for i in range(k): # select only data observations with cluster label == i ds = X[np.where(clusters==i)] # plot the data observations plt.scatter(ds[:,0],ds[:,1],c=colors[i]) # plot the centroids plt.scatter(centroids[i,0],centroids[i,1],marker='x',s=200, c='black') plt.show()
Output
The above code will group the data points into 4 clusters and will plot the data points with different colours according to their cluster assignment. The cluster centroids are represented by 'x' markers.
One can adjust the number of clusters to suit your data and problem.
In this example, I have used the dataset whose link is given above, and then using the k-means algorithm for clustering and visualizing the results.
Keep in mind that the k-means algorithm is sensitive to initial conditions, so the results may vary if you run it multiple times with different initial centroids.
Hierarchical Clustering with SciPy
Hierarchical clustering is a method of clustering that creates a hierarchy of clusters, where each cluster is a subset of the previous one. The hierarchy is represented as a tree-like structure called a dendrogram. It is a powerful method for exploring and visualizing the structure of large datasets. Still, it can be computationally expensive for large datasets and sensitive to the linkage method used.
Example
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree # sample data points data = [[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]] # create linkage matrix Z = linkage(data, method='ward') # create dendrogram dendrogram(Z) # cut the dendrogram at a threshold to obtain clusters clusters = cut_tree(Z, height=2) import matplotlib.pyplot as plt plt.figure() dendrogram(Z, labels = ["data1","data2","data3","data4","data5","data6"]) plt.show()
Output
The above code will group the data points into clusters using the linkage method 'ward', which minimizes the variance of the distances between the clusters being linked. The dendrogram function is used to plot the dendrogram, which is a visualization of the hierarchical clustering solution. The cut_tree function is used to extract the clusters from the dendrogram at a given threshold. The output of the cut_tree function is a list of cluster labels for each data point. It is also possible to visualize the dendrogram using the matplotlib library and customize the appearance, such as the colour and size of the lines, labels, etc.
Conclusion
SciPy is not available for all types of clustering however it is efficient to perform k-means as well as Hierarchical clustering. SciPy's k-means algorithm is a simple and efficient method for partitioning a dataset into a fixed number of clusters. Hierarchical clustering is a method that creates a hierarchy of clusters, where each cluster is a subset of the previous one. Widely used algorithms like DBSCAN cannot be implemented using SciPy.
Hence if you are looking for a wide range of clustering algorithms, with built-in support for pre-processing, evaluating and more flexibility, scikit-learn is the best choice.