What is Agglomerative Hierarchical Clustering?

Data MiningDatabaseData Structure

Agglomerative Hierarchical clustering is a bottom-up clustering approach where clusters have sub-clusters, which consecutively have sub-clusters, etc. It starts by locating every object in its cluster and then combines these atomic clusters into higher and higher clusters until some objects are in a single cluster or until it needs a definite termination condition. Several hierarchical clustering approach are used to this type. They are distinct only in their description of between-cluster similarity.

For example, a method called AGNES (Agglomerative Nesting) needs the single-link techniques and operates as follows. Consider there are groups of objects placed in a rectangle. Initially, each object is located into a cluster of its own. Therefore the clusters are combined step-by-step as per some principle involving merging the clusters with the minimum Euclidean distance between the closest objects in the cluster.

Hierarchical clustering is shown graphically using a tree-like diagram known as a dendrogram, which shows both the cluster-subcluster associations and the order in which the clusters were combined (agglomerative view) or split (divisive view).

Basic agglomerative hierarchical clustering algorithm.

  • Compute the proximity matrix, if necessary.

  • repeat

  • Merge the closest two clusters.

  • Refresh the proximity matrix to reflect the proximity among the new cluster and the initial clusters.

  • until only one cluster remains.

Cluster proximity is generally defined with a specific type of cluster. For instance, several agglomerative hierarchical clustering techniques, including MIN, MAX, and Group Average, come from a graph-based view of clusters.

MIN defines cluster proximity as the proximity between the closes two points that are in multiple clusters, or using graph methods, the shortest edge among two nodes in several subsets of nodes.

Alternatively, MAX takes the proximity between the furthest two points in multiple clusters to be the cluster proximity or using graph methods, the highest edge between two nodes in different subsets of nodes.

The concept agglomerative hierarchical clustering algorithm presented require a proximity matrix. This required the storehouse of $\mathrm{\frac{1}{2}m^2}$ proximities (considering the proximity matrix is symmetric) where m is the multiple data points. The space required to maintain track of the clusters is proportional to the multiple clusters, which is m- 1, excluding singleton clusters. Therefore, the total space complexity is $\mathrm{O(m^2)}$.

The analysis of the basic agglomerative hierarchical clustering algorithm is also easy concerning computational complexity. $\mathrm{O(m^2)}$ time is needed to calculate the proximity matrix. After that step, there are m - 1 iteration containing steps 3 and 4 because there are m clusters at the start and two clusters are merged during every iteration.

Updated on 14-Feb-2022 11:36:52