What are the additional issues of K-Means Algorithm in data mining?


There are various issues of the K-Means Algorithm which are as follows −

Handling Empty Clusters − The first issue with the basic K-means algorithm given prior is that null clusters can be acquired if no points are allocated to a cluster during the assignment phase. If this occurs, then a method is needed to choose a replacement centroid, because the squared error will be larger than necessary.

One method is to select the point that is farthest away from some recent centroid. If this removes the point that currently contributes some total squared error. Another method is to select the replacement centroid from the cluster that has the largest SSE. This will generally divide the cluster and decrease the complete SSE of the clustering. If there are multiple null clusters, then this process can be repeated multiple times.

Outliers − When the squared error method is used, outliers can unduly tend to the clusters that are discovered. In specific, when outliers are present, the resulting cluster centroids (prototypes) cannot be as representative as they can be, and thus, the SSE will be higher as well.

It is beneficial to find outliers and remove them beforehand. It is essential to appreciate that there are specific clustering applications for which outliers should not be removed. When clustering is used for data compression, each point should be clustered, and in some cases, including financial analysis, probable outliers, e.g.,unusually profitable users, can be the interesting points.

Reducing the SSE with Postprocessing − The method to reduce the SSE is to find more clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE, but don't require to increase the number of clusters. This is possible because Kmeans generally converge to a local minimum.

Various methods are used to "fix-up" the resulting clusters to make a clustering that has lower SSE. The method is to target on individual clusters because the complete SSE is easily the total of the SSE contributed by every cluster. It can change the total SSE by implementing several operations on the clusters, including splitting or merging clusters.

One method is to use an alternate cluster splitting and merging procedure. During a splitting procedure, clusters are divided, while during a merging procedure, clusters are combined. In this method, it is accessible to withdrawal local SSE minima and create a clustering solution with the seized number of clusters. The following are some methods used in the splitting and merging phases which are as follows −

Updated on: 14-Feb-2022

7K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements