What is CURE?

CURE represents Clustering Using Representative. It is a clustering algorithm that uses a multiple techniques to make an approach that can manage high data sets, outliers, and clusters with non-spherical architecture and non-uniform sizes. CURE defines a cluster by using several representative points from the cluster.

These points will taking the geometry and architecture of the cluster. The first representative point is selected to be the point farthest from the middle of the cluster, while the remaining points are selected so that they are farthest from all the earlier selected points. In this method, the representative points are associatively well distributed. The multiple points chosen is a parameter, but it was discovered that a value of 10 or more operated well.

Because the representative points are selected, they are diminished toward the center by a factor,𝛼. This support moderate the effect of outliers, which are generally further away from the center and therefore, are shrunk more. For instance, a representative point that was a distance of 10 units from the center can change by 3 units (for 𝛼 = 0.7), while a representative point at a distance of 1 unit can change 0.3 units.

CURE takes benefit of specific characteristics of the hierarchical clustering process to remove outliers at two multiple points in the clustering phase. First, if a cluster is increasing slowly, then this can mean that it includes mostly of outliers, because by definition, outliers are far from others and will not be combined with different points very often.

In CURE, this first procedure of outlier elimination generally appears when the number of clusters is 1/3 the initial number of points. The second procedure of outlier elimination appears when the multiple clusters is on the order of K, the multiple desired clusters. At this point, small clusters are removed.

Because the worst-case complexity of CURE is $\mathrm{O(m^2logm)}$, it cannot be used precisely to high data sets. CURE uses two methods to speed up the clustering procedure. The first method takes a random sample and implements hierarchical clustering on the sampled data points. This is followed by a last pass that creates each remaining point in the data set to one of the clusters by selecting the cluster with the nearest representative point.

In some cases, the sample needed for clustering is high and a second more technique is needed. In this situation, CURE partitions the sample data and clusters the points in every partition. This pre-clustering procedure is followed by a clustering of the intermediate clusters and a last pass that creates each point in the data set to one of the clusters.