What is STREAM?

STREAM is an individual-pass, constant element approximation algorithm that was produced for the k-medians problem. The k-medians problem is to cluster N data points into k clusters or groups such that the sum squared error (SSQ) between the points and the cluster center to which they are assigned is minimized. The idea is to assign similar points to the same cluster, where these points are dissimilar from points in other clusters.

In the stream data model, data points can only be seen once, and memory and time are limited. It can implement high-quality clustering, the STREAM algorithm processes data streams in buckets (or batches) of m points, with each bucket fitting in main memory.

For each bucket, bi, STREAM clusters the bucket’s points into k clusters. It then summarizes the bucket information by retaining only the information regarding the k centers, with each cluster center being weighted by the number of points assigned to its cluster.

STREAM then discards the points, retaining only the center information. Because enough centers have been collected, the weighted centers are clustered to make another group of O(k) cluster centers. This is repeated so that at every level, at most m points are retained. This approach results in a one-pass, O(kN)-time, O(Nε)-space (for some constant ε < 1), constant-factor approximation algorithm for data stream k-medians.

STREAM changes quality k-medians clusters with definite area and time. However, it treated neither the evolution of the records nor time granularity. The clustering can become dominated by the older, outdated data of the stream. The feature of the clusters can vary with both the moment at which they are evaluated, and the time horizon over which they are measured.

For example, a user can required to test clusters appearing last week, last month, or last year. These can be different. Hence, a data stream clustering algorithm must also support the flexibility to calculate clusters over user-defined time periods in an interactive manner.

CluStream is an algorithm for the clustering of evolving data streams based on user-specified, online clustering queries. It divides the clustering process into on-line and offline components.

The online component computes and stores summary statistics about the data stream using micro-clusters, and performs incremental online computation and maintenance of the micro-clusters. The offline component does macro-clustering and solve several user questions using the saved summary statistics, which are depends on the tilted time frame model.

The cluster evolving data streams based on both historical and current stream data information, the tilted time frame model (such as a progressive logarithmic model) is adopted, which stores the snapshots of a set of microclusters at different levels of granularity depending on recency.