What is ROCK?

ROCK stands for Robust Clustering using links. It is a hierarchical clustering algorithm that analyze the concept of links (the number of common neighbours among two objects) for data with categorical attributes. It display that such distance data cannot lead to high-quality clusters when clustering categorical information.

Moreover, most clustering algorithms create only the similarity among points when clustering i.e., at each step, points that are combined into a single cluster. This “localized” method is prone to bugs. For instance, two distinct clusters can have a few points or outliers that are near; thus, relying on the similarity among points to create clustering decisions can generate the two clusters to be combined.

ROCK takes a more global method to clustering by treating the neighborhoods of single pairs of points. If two similar points also have same neighborhoods, thus the two points likely belong to the similar cluster and so can be combined.

There are two points, pi and pj, are neighbors if sim(pi, pj) ≥ θ, where sim is a similarity function and θ is a user-specified threshold. It can select sim to be a distance metric or even a nonmetric that is normalized so that its values fall among 0 and 1, with higher values denoting that the points are more same.

The number of connection between pi and pj is represented as the number of common neighbors between pi and pj. If the number of links between two points is high, then it is more likely that they belong to the similar cluster. By treating neighboring data points in the relationship among individual group of points, ROCK is powerful than standard clustering methods that target only on point similarity.

An instance of data including categorical attributes is market basket information. Such data includes a database of transactions, where each transaction is a group of items. Transactions are treated data with Boolean attributes, each corresponding to a single item, including bread or cheese.

In the data for a transaction, the attribute corresponding to an item is correct if the transaction include the item; otherwise, it is false. There are several data sets with categorical attributes can be managed in a same manner. ROCK’s terms of neighbors and links are the same between two “points” or transactions, Ti and Tj, is represented with the Jaccard coefficient as

$$\mathrm{sim(T_{i},T_{j})=\frac{|T_{i} \cap T_{j}|}{|T_{i} \cup T_{j}|}}$$

ROCK first produce a sparse graph from a given data similarity matrix utilizing a similarity threshold and the approach of shared neighbors. It can implements agglomerative hierarchical clustering on the sparse graph. A goodness measure can compute the clustering. Random sampling can be used for scaling up to high data sets.

The worst-case time complexity of ROCK is O(n2 + nmmma + n2logn) where mm and ma the maximum and average number of neighbors, accordingly, are and n is the number of objects.