How can we further improve the efficiency of Apriori-based mining?

Data MiningDatabaseData Structure

There are some variations of the Apriori algorithm that have been projected that target developing the efficiency of the original algorithm which are as follows −

The hash-based technique (hashing itemsets into corresponding buckets) − A hash-based technique can be used to decrease the size of the candidate k-itemsets, Ck, for k > 1. For instance, when scanning each transaction in the database to create the frequent 1-itemsets,L1, from the candidate 1-itemsets in C1, it can make some 2-itemsets for each transaction, hash (i.e., map) them into the several buckets of a hash table structure, and increase the equivalent bucket counts.

Transaction reduction − A transaction that does not include some frequent k-itemsets cannot include some frequent (k + 1)-itemsets. Thus, such a transaction can be marked or deleted from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not need it.

Partitioning − A partitioning technique can be used that needed two database scans to mine the frequent itemsets. It includes two phases involving In Phase I, the algorithm subdivides the transactions of D into n non-overlapping partitions. If the minimum support threshold for transactions in D is min_sup, therefore the minimum support count for a partition is min_sup × the number of transactions in that partition.

For each partition, all frequent itemsets within the partition are discovered. These are defined as local frequent itemsets. The process employs a specific data structure that, for each itemset, records the TIDs of the transactions including the items in the itemset. This enables it to find all of the local frequent k-itemsets, for k = 1, 2... in only one scan of the database.

A local frequent itemset can or cannot be frequently related to the whole database, D. Any itemset that is possibly frequent related D must appear as a frequent itemset is partially one of the partitions. Thus, all local frequent itemsets are candidate itemsets slightly D. The set of frequent itemsets from all partitions forms the worldwise candidate itemsets for D. In Phase II, the second scan of D is organized in which the actual support of each candidate is assessed to decide the global frequent itemsets.

Sampling − The fundamental idea of the sampling approach is to select a random sample S of the given data D, and then search for frequent itemsets in S rather than D. In this method, it can trade off some degree of accuracy against efficiency. The sample size of S is such that the search for frequent itemsets in S can be completed in main memory, and therefore only one scan of the transactions in S is needed overall.

raja
Published on 24-Nov-2021 06:52:03
Advertisements