How to use K-Means clustering algorithm in Python Scikit-learn?

PythonScikit-learnServer Side ProgrammingProgramming

K-Means clustering algorithm computes the centroids and iterates until it finds optimal centroid. It requires the number of clusters to be specified that’s why it assumes that they are already known. The main logic of this algorithm is to cluster the data separating samples in n number of groups of equal variances by minimizing the criteria known as the inertia. The number of clusters identified by algorithm is represented by ‘K.

Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering algorithm in Python.

Example

For the example below, we will create a test binary classification dataset by using the make_classification() function. This dataset would consist of 10000 samples with two input features and one cluster per class.

# Import required libraries from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import KMeans from matplotlib import pyplot #%matplotlib inline # Set the figure size pyplot.rcParams["figure.figsize"] = [7.16, 3.50] pyplot.rcParams["figure.autolayout"] = True # Define binary classification dataset having 10000 samples with two input features and one cluster per class. X,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # Create scatter plot for all samples from each class for value in range(2): # Getting row indexes for samples row = where(y == value) # Creating scatter plot of all the samples pyplot.scatter(X[row, 0], X[row, 1]) # Plot the figure pyplot.title('Classification Dataset', size ='18') pyplot.show() # Define the KMeans clustering model KMeans_model = KMeans(n_clusters=2) # Fit the model KMeans_model.fit(X) # Assigning a cluster per sample yc = KMeans_model.predict(X) # Retrieve the unique clusters from all clusters clusters_AC = unique(yc) # Create scatter plot for all samples from each cluster for cluster in clusters_AC: # Getting row indexes for all samples within this cluster row = where(yc == cluster) # creating scatter plot of all the samples pyplot.scatter(X[row, 0], X[row, 1]) # Plot the figure pyplot.title('Cluster Prediction for Each Example in Dataset', size ='18') pyplot.show()

Output

It will produce the following output −



Mini-Batch K-Means Clustering Algorithm

Mini-Batch K-Means clustering algorithm is a modified version of k-means. Rather than using the entire dataset, as its name suggest, it makes updates to the cluster centroids using mini batches of samples. Due to this mini-batch k-means is faster and more robust.

Scikit-learn have sklearn.cluster.MiniBatchKMeans module to perform Mini-Batch K-Means clustering algorithm in Python.

Example

For the example below, we will create a test binary classification dataset by using the make_classification() function. This dataset would consist of 10000 samples with two input features and one cluster per class.

# Import required libraries from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import MiniBatchKMeans from matplotlib import pyplot # %matplotlib inline # Set the figure size pyplot.rcParams["figure.figsize"] = [7.16, 3.50] pyplot.rcParams["figure.autolayout"] = True # Define binary classification dataset having 10000 samples with two input features and one cluster per class. X,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # Create scatter plot for all samples from each class for value in range(2): # Getting row indexes for samples row = where(y == value) # Creating scatter plot of all the samples pyplot.scatter(X[row, 0], X[row, 1]) # Plot the figure pyplot.title('Classification Dataset', size ='18') pyplot.show() # Define the KMeans clustering model MBKMeans_model = MiniBatchKMeans(n_clusters=2) # Fit the model MBKMeans_model.fit(X) # Assigning a cluster per sample yc = MBKMeans_model.predict(X) # Retrieve the unique clusters from all clusters clusters_AC = unique(yc) # Create scatter plot for all samples from each cluster for cluster in clusters_AC: # Getting row indexes for all samples within this cluster row = where(yc == cluster) # creating scatter plot of all the samples pyplot.scatter(X[row, 0], X[row, 1]) # Plot the figure pyplot.title('Cluster Prediction for Each Example in Dataset', size ='18') pyplot.show()

Output

It will produce the following output −



raja
Updated on 04-Oct-2022 08:43:08

Advertisements