How does the k-means algorithm work?

Data MiningDatabaseData Structure

The k-means algorithm creates the input parameter, k, and division a group of n objects into k clusters so that the resulting intracluster similarity is large but the intercluster analogy is low. Cluster similarity is computed regarding the mean value of the objects in a cluster, which can be looked as the cluster’s centroid or center of gravity.

The k-means algorithm proceeds as follows. First, it can randomly choose k of the objects, each of which originally defines a cluster mean or center. For each of the remaining objects, an object is created to the cluster to which it is the same, depends on the distance among the object and the cluster mean.

It can calculates the new mean for each cluster. This phase iterates until the principle function converges. Generally, the square-error criterion is represented as −

$$\mathrm{E=\displaystyle\sum\limits_{i=1}^k\displaystyle\sum\limits_{p\epsilon C_{i}}|p-m_{i}|^2}$$

Where E is the total of the square error for some objects in the data set. p is the point in space defining a given object and mi is the mean of cluster Ci (both p and miare multidimensional). Particularly, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are estimated. This criterion tries to create the resulting k clusters as compact and as independent as applicable.

Algorithm: k-means − The k-means algorithm for partitioning, where every cluster’s center is defined by the mean value of the objects in the cluster.

Input −

k: the number of clusters,
D: a data set including n objects.

Output −

A set of k clusters.

Method −

  • arbitrarily select k objects from D as the original cluster centers;

  • repeat

  • (re)assign each object to the cluster to which the object is the same, depends on the mean value of the objects in the cluster;

  • update the cluster means, i.e., compute the mean value of the objects for each cluster;

  • until no change;

It is used to arbitrarily select three objects as the three original cluster centers, where cluster centers are denoted by a “+”. Each object is distributed to a cluster depends on the cluster center to which it is the convenient.

Next, the cluster centers are updated. The mean value of each cluster is recomputed based on the prevailing objects in the cluster. By utilizing the new cluster centers, the objects are redistributed to the clusters depends on which cluster center is the adjacent. Such a redistribution structure new silhouettes surrounded by dashed curves.

The phase of iteratively reassigning objects to clusters to enhance the partitioning is defined as iterative relocation. There is no redistribution of the objects in any cluster appears, and so the process removes. The resulting clusters are restored by the clustering phase.

Updated on 16-Feb-2022 12:23:12