What is BIRCH?

Data MiningDatabaseData Structure

BIRCH represents Balanced Iterative Reducing and Clustering Using Hierarchies. It is designed for clustering a huge amount of numerical records by integration of hierarchical clustering and other clustering methods including iterative partitioning.

BIRCH offers two concepts, clustering feature and clustering feature tree (CF tree), which are used to summarize cluster description. These structures facilitate the clustering method to achieve the best speed and scalability in huge databases and also create it effective for incremental and dynamic clustering of incoming objects.

Given n d-dimensional data objects or points in a cluster, and it can represent the centroid x0, radius R, and diameter D of the cluster as follows −

$$x_{0}=\frac{\sum_{i=1}^{n}x_{i}}{n}$$

$$R=\sqrt{\frac{\sum_{i=1}^{n}(x_{i}-x_{0})^{2}}{n}}$$

$$D=\sqrt{\frac{\sum_{i=1}^{n}\sum_{j=1}^{n}(x_{i}-x_{j})^{2}}{n(n-1)}}$$

where R is the average distance from member elements to the centroid, and D is the average pairwise distance inside a cluster. Both R and D reverse the tightness of the cluster around the centroid. A clustering feature (CF) is a three-dimensional vector summarizing data about clusters of objects. Given n d-dimensional objects or points in a cluster, {xi}, then the CF of the cluster is represented as

CF=(n,LL,SS)

where n is the number of points in the cluster, LS is the linear sum of the n points $\sum_{i=1}^{n}(x_{i})$ ,and SS is the square sum of the data points (i.e.,$\sum_{i=1}^{n}x_{i}^{2}$)

A clustering feature is a summary of the statistics for the given cluster: the zeroth, first, and second moments of the cluster from a statistical point of view. Clustering features are a supplement. For instance, assume that we have two disjoint clusters, C1 and C2, holding the clustering features, CF1 and CF2, commonly. The clustering feature for the cluster that is formed by combining C1 and C2 is simply CF1 +CF2.

Clustering features are sufficient for computing all of the measurements that are required for developing clustering decisions in BIRCH. BIRCH uses storage efficiently by employing the clustering features to summarize data about the clusters of objects, thereby bypassing the requirement to save all objects.

A CF tree is a height-balanced tree that saves the clustering features for hierarchical clustering. A non-leaf node in a tree has descendants or “children.” The non-leaf nodes store sums of the CFs of their children and therefore summarize clustering data about their children.

A CF tree has two parameters including branching factor, B, and threshold, T. The branching element defines the maximum number of children per non-leaf node. The threshold parameter defines the maximum diameter of sub-clusters saved at the leaf nodes of the tree. These two parameters hold the size of the resulting tree.