What are the clustering methods for spatial data mining?

Cluster analysis is a branch of statistics that has been studied widely for several years. The benefit of using this technique is that interesting structures or clusters can be discovered directly from the data without utilizing any background knowledge, such as concept hierarchy.

Clustering algorithms used in statistics, like PAM or CLARA, are reported to be inefficient from the computational complexity point of view. As per the efficiency concern, a new algorithm called CLARANS (Clustering Large Applications based upon Randomized Search) was developed for cluster analysis.

PAM (Partitioning around Medoids) − It is assuming that there are n objects, PAM finds k clusters by first finding a representative object for each cluster. Such a representative, which is the centrally located point in a cluster, is known as medoid.

After choosing k medoids, the algorithm repeatedly tries to create the best choice of medoids analyzing all feasible pairs of objects such that one object is a medoid and the other is not. The measure of clustering quality is calculated for each such combination.

The good choice of points in one iteration is selected as the medoids for the following iteration. The cost of a single iteration is O(k(n−k)2) . It is therefore computationally quite inefficient for large values of n and k.

CLARA (Clustering Large Applications) − The difference between the PAM and CLARA algorithms is that the following one is based upon sampling. There is only a small area of the real data is chosen as a representative of the data and medoids are chosen from this sample utilizing PAM.

The idea is that if the sample is selected in a fairly random manner, then it correctly represents the whole dataset and therefore, the representative objects (medoids) chosen will be similar as if chosen from the whole dataset.

CLARA draws several samples and outputs the good clustering out of these samples. CLARA can deal with a higher dataset than PAM. The complexity of each iteration now becomes O(kS2+k(n−k)), where S is the size of the sample.

CLARANS (Clustering Large Applications based upon RANdomized Search) − CLARANS algorithm combines both PAM and CLARA by searching only the subset of the dataset and it does not constraint itself to some sample at any given time. While CLARA has a constant sample at each phase of the search, CLARANS draws a sample with some randomness in every phase of the search.

The clustering phase can be presented as searching a graph where each node is a possible solution, i.e, a set of k medoids. The clustering obtained after replacing a single medoid is called the neighbor of the current clustering.