How can we discover frequent substructures?


The discovery of frequent substructures usually consists of two steps. In the first step, it can make frequent substructure candidates. The frequency of every candidate is tested in the second step. Most studies on frequent substructure discovery focus on the optimization of the first step because the second step involves a subgraph isomorphism test whose computational complexity is excessively high (i.e., NP-complete).

There are various methods for frequent substructure mining which are as follows −

Apriori-based Approach − Apriori-based frequent substructure mining algorithms send the same features with Apriori-based frequent itemset mining algorithms. The search for frequent graphs begins with graphs of small “size,” and proceeds in a bottom-up manner by making candidates have an additional vertex, edge, or path. The representation of graph size is based on the algorithm used.

The main design complexity of Apriori-based substructure mining algorithms is the candidate generation step. The candidate production in frequent itemset mining is truthful. For example, suppose we have two frequent itemsets of size-3:(abc) and (bcd).

The frequent itemset candidate of size-4 generated from them is easily (abcd), changed from a join. However, the candidate generation problem in frequent substructure mining is harder than that in frequent itemset mining, because there are many ways to join two substructures.

Pattern-Growth Approach − The Apriori-based approach has to use the breadth-first search (BFS) strategy because of its level-wise candidate generation. To determine whether a size-(k + 1) graph is frequent, it must check all of its corresponding size-k subgraphs to obtain an upper bound of its frequency. Thus, before mining any size-(k +1) subgraph, the Apriori-like approach usually has to complete the mining of size-k subgraphs.

Therefore, BFS is necessary for the Apriori-like approach. In contrast, the pattern-growth method is more dynamic concerning its search method. It can use breadth-first search as well as depth-first search (DFS), the latter of which consumes less memory.

Pattern Growth Graph is simple, but not efficient. The bottleneck is at the inefficiency of extending a graph. The same graph can be found several times. For example, there may exist n different (n − 1)-edge graphs that can be extended to the same n-edge graph. The repeated discovery of the same graph is computationally inefficient. We call a graph that is discovered a second time a duplicate graph.

It can reduce the generation of duplicate graphs, each frequent graph should be extended as conservatively as possible. This principle leads to the design of several new algorithms. The spanning algorithm is designed to reduce the generation of duplicate graphs. It need not search previously discovered frequent graphs for duplicate detection. It does not extend any duplicate graph, yet still guarantees the discovery of the complete set of frequent graphs.

Updated on: 25-Nov-2021

214 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements