- Trending Categories
- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

# How efficient is the k-medoids algorithm on large data sets?

A classic k-medoids partitioning algorithm like PAM works efficiently for small data sets but does not scale well for huge data sets. It can deal with higher data sets, a sampling-based method, known as CLARA (Clustering Large Applications), can be used.

The approach behind CLARA is as follows: If the sample is chosen in a fairly random manner, it must closely define the original data set. The representative objects (medoids) chosen will be similar to those that would have been selected from the entire data set. CLARA draws several samples of the data set, applies PAM on each sample, and returns its best clustering as the output.

The performance of CLARA is based on the sample size. It is observed that PAM searches for the best k medoids between a given data set, whereas CLARA searches for the best k medoids between the selected samples of the data set. A k-medoids type algorithm known as CLARANS (Clustering Large Applications depends upon RANdomized Search) was proposed. It can connect the sampling methods with PAM. While CLARA has a fixed sample at every stage of the search, CLARANS draws a sample with some randomness in every phase of the search.

The clustering procedure can be viewed as a search through a graph, where each node is a probable solution (a set of k medoids). Two nodes are neighbors (especially, linked by an arc in the graph) if their sets differ by only one object. Each node can be assigned a cost that is represented by the total dissimilarity between each object and the medoid of its cluster.

At each step, PAM determines all of the neighbors of the latest node in its search for a minimum cost solution. The latest node is then replaced by the neighbor with the hugest descent in costs. Because CLARA operates on a sample of the whole data set, it determines fewer neighbors and restricts the search to subgraphs that are smaller than the initial graph.

CLARANS has been experimentally shown to be more efficient than both PAM and CLARA. It can be used to discover the most “natural” number of clusters using a silhouette coefficient a property of an object that defines how much the object truly applies to the cluster. CLARANS also allow the discovery of outliers.

The computational complexity of CLARANS is O(n^{2}) where n is the number of objects. Moreover, its clustering quality is based on the sampling method used. The ability of CLARANS to manage with data objects that reside on disk can be moreover improved by focusing on methods that explore spatial data structures, including R*-trees.

- Related Articles
- What is the K-nearest neighbors algorithm?
- How does the k-means algorithm work?
- Yen's k-Shortest Path Algorithm in Data Structure
- What are the additional issues of K-Means Algorithm in data mining?
- Efficient algorithm for grouping elements and counting duplicates in JavaScript
- Algorithm for sorting array of numbers into sets in JavaScript
- How to store large data in JavaScript cookies?
- Divide Array in Sets of K Consecutive Numbers in C++
- How to parameterize tests with multiple data sets using Rest Assured?
- How can we evaluate the performance of a Data Encryption Algorithm?
- KCR Sets Telangana On Early Poll
- What is Data Encryption and Decryption in Blowfish Algorithm?
- How to pass large data between activities in Android?
- Large to Small Sorting Algorithm of already sorted array in JavaScript
- How to draw large font on HTML5 Canvas?