How does the Lossy Counting algorithm find frequent items?

Data MiningDatabaseData Structure

A user supports two input parameters including the min support threshold, σ, and the error bound previously, indicated as ε. The incoming stream is theoretically divided into buckets of width w = [1/ε].

Let N be the current stream length, i.e., the number of items view so far. The algorithm needs a frequency-list data structure for all elements with frequency higher than 0. For every item, the list supports f, the approximate frequency count, and ∆, the maximum possible error of f.

The algorithm procedure buckets of items as follows. When a new bucket arrives in, the items in the bucket are inserted to the frequency list. If a given item exists in the list, it can simply increase its frequency count, f. Otherwise, it can add it into the list with a frequency count of 1. If the new item is from the bth bucket, it can set ∆, the maximum possible bug on the frequency count of the item, to be b−1.

Whenever a bucket boundary is acquired (i.e., N has reached a multiple of width w, including w, 2w, 3w, etc.), the frequency list is determined. Let b be the current bucket number. An item entry is removed if, for that entry, f + ∆ ≤ b. In this approach, the algorithm objective to maintain the frequency list small so that it can fit in primary memory. The frequency count saved for each item will be the true frequency of the item or minimize of it.

The essential factors in approximation algorithms is the approximation ratio (or error bound). Let’s look at the case where an item is removed. This appears when f +∆ ≤ b for an item, where b is the current bucket number.

It can understand that b ≤ N/w, that is, b ≤ εN. The real frequency of an item is at most f+∆. Therefore, an item can be minimize is εN. If the real support of this item is σ (this is the minimum support or lower bound for it to be treated frequent), therefore the actual frequency is σN, and the frequency, f, on the frequency list should be minimum (σN −εN).

Therefore, if we output all of the items in the frequency list having an f value of minimum (σN −εN), therefore some frequent items will be output. Moreover, some subfrequent items (with an actual frequency of minimum σN −εN but less than σN) will be output.

Updated on 17-Feb-2022 11:32:55