How efficient is the k-medoids algorithm on large data sets?

A classic k-medoids partitioning algorithm like PAM works efficiently for small data sets but does not scale well for huge data sets. It can deal with higher data sets, a sampling-based method, known as CLARA (Clustering Large Applications), can be used.

The approach behind CLARA is as follows: If the sample is chosen in a fairly random manner, it must closely define the original data set. The representative objects (medoids) chosen will be similar to those that would have been selected from the entire data set. CLARA draws several samples of the data set, applies PAM on each sample, and returns its best clustering as the output.

The performance of CLARA is based on the sample size. It is observed that PAM searches for the best k medoids between a given data set, whereas CLARA searches for the best k medoids between the selected samples of the data set. A k-medoids type algorithm known as CLARANS (Clustering Large Applications depends upon RANdomized Search) was proposed. It can connect the sampling methods with PAM. While CLARA has a fixed sample at every stage of the search, CLARANS draws a sample with some randomness in every phase of the search.

The clustering procedure can be viewed as a search through a graph, where each node is a probable solution (a set of k medoids). Two nodes are neighbors (especially, linked by an arc in the graph) if their sets differ by only one object. Each node can be assigned a cost that is represented by the total dissimilarity between each object and the medoid of its cluster.

At each step, PAM determines all of the neighbors of the latest node in its search for a minimum cost solution. The latest node is then replaced by the neighbor with the hugest descent in costs. Because CLARA operates on a sample of the whole data set, it determines fewer neighbors and restricts the search to subgraphs that are smaller than the initial graph.

CLARANS has been experimentally shown to be more efficient than both PAM and CLARA. It can be used to discover the most “natural” number of clusters using a silhouette coefficient a property of an object that defines how much the object truly applies to the cluster. CLARANS also allow the discovery of outliers.

The computational complexity of CLARANS is O(n2) where n is the number of objects. Moreover, its clustering quality is based on the sampling method used. The ability of CLARANS to manage with data objects that reside on disk can be moreover improved by focusing on methods that explore spatial data structures, including R*-trees.