What is CluStream?

Data MiningDatabaseData Structure

CluStream is an algorithm for the clustering of evolving data streams based on userspecified, online clustering queries. It divides the clustering process into on-line and offline components.

The online component computes and stores summary statistics about the data stream using micro-clusters, and performs incremental online computation and maintenance of the micro-clusters. The offline component does macro-clustering and answers various user questions using the stored summary statistics, which are based on the tilted time frame model.

The cluster evolving data streams based on both historical and current stream data information, the tilted time frame model (such as a progressive logarithmic model) is adopted, which stores the snapshots of a set of microclusters at different levels of granularity depending on recency.

The intuition here is that more information will be needed for more recent events as opposed to older events. The stored information can be used for processing history-related, user-specific clustering queries. A microcluster in CluStream is defined as a clustering feature.

CluStream extends the concept of the clustering feature developed in BIRCH to include the temporal domain. As a temporal extension of the clustering feature, a microcluster for a set of d-dimensional points,X1, . . . , Xn, with timestamps, T1,...,Tn,is defined as the (2d +3) tuple (CF2x ,CF1x ,CF2t , CF1t , n), wherein CF2x and CF1x are d-dimensional vectors while CF2t , CF1t , and n are scalars. CF2x maintains the sum of the squares of the data values per dimension, that is,$\sum_{i=1}^{n}{X_{i}}^{2}$

Similarly, for each dimension, the sum of the data values is maintained in CF1x. From a statistical point of view, CF2x and CF1x represent the second-and first-order moments of the data, respectively. The sum of squares of the timestamps is maintained in CF2t. The sum of the timestamps is maintained in CF1t. Finally, the number of data points in the microcluster is maintained in n.

Clustering features have additive and subtractive properties that make them very useful for data stream cluster analysis. For example, two microclusters can be merged by adding their respective clustering features. Furthermore, a large number of microclusters can be maintained without using a great deal of memory. Snapshots of these microclusters are stored away at key points in time based on the tilted time frame.

Online microcluster processing is divided into two phases such as statistical data collection and updating of microclusters. In the first phase, a total of q microclusters, M1 ,..., Mq, are maintained, where q is usually significantly larger than the number of natural clusters and is determined by the amount of available memory.

In the second phase, microclusters are updated. Each new data point is added to either an existing cluster or a new one. It can decide whether a new cluster is required, a maximum boundary for each cluster is defined.

Published on 25-Nov-2021 07:57:07