What is scipy.cluster.vq.kmeans()method?

The scipy.cluster.vq.kmeans(obs, k_or_guess, iter=20, thresh=1e- 05, check_finite=True)method forms k clusters by performing a k-means algorithm on a set of observation vectors. To determine the stability of the centroids, this method uses a threshold value to compare the change in average Euclidean distance between the observations and their corresponding centroids. The output of this method is a code book mapping centroid to codes and vice versa.

Below is given the detailed explanation of its parameters−


  • obs− ndarray

    It is an ‘M’ by ‘N’ array where each row is an observation, and the columns are the features seen during each observation. Before using, these features must be whitened by using the whiten() function.

  • k_or_guess− int or ndarray

    It is the number of centroids to be generated. Once generated, each centroid is given a code. This code is also the row index of the centroid in the code_book matrix. Initially, the k centroids will be selected randomly from the observation matrix.

  • iter− int, optional

    This parameter represents the number of times to run k-means so that it returns the codebook with lowest distortion. If you have already specified initial centroids with k_or_guess parameter, this parameter should be ignored.

  • thresh− float, optional

    This parameter represents the threshold value. If the change in distortion since the last iteration is less than or equal to this threshold value, the algorithm will be terminated by default.

  • check_finite− bool, optional

    This parameter is used to check whether the input matrices contain only finite numbers. Disabling this parameter may give you a performance gain but it may also result in some problems like crashes or non-termination if the observations do contain infinites. The default value of this parameter is True.


  • code− ndarray

    It returns a k by N array of k centroids where the jth centroid codebook is represented with the code j. This codebook gives the lowest distortion seen.

  • distortion− float

    This is the mean Euclidean distance between the observation vector passed and the centroids generated.