What is GSP?


GSP stands for Generalised Sequential Patterns. It is a sequential pattern mining method that was produced by Srikant and Agrawal in 1996. It is an expansion of their seminal algorithm for usual itemset mining, referred to as Apriori. GSP needs the downward-closure natures of sequential patterns and adopts a several-pass, students create-and-test approach.

The algorithm is as follows. In the first scan of the database, it can discover some frequent items, i.e., those with minimum support. Each item yields a 1-event frequent sequence including that item. Each subsequent pass begins with a seed group of sequential patterns and the group of sequential patterns found in the earlier pass.

This seed set can create new potentially frequent patterns, known as candidate sequences. Each candidate series include one more item than the seed sequential pattern from which it was created (where each event in the pattern can include one or multiple items).

The multiple instances of items in a sequence is the height of the sequence. Therefore, some candidate sequences in a given pass will have the same height. It defines a sequence with length k as a k-sequence.

Let Ck indicate the set of candidate k-sequences. A pass over the database discovers the support for every candidate k-sequence. The candidates in Ck with minimum min_sup form Lk, the set of all frequent k-sequences. This set develop into the seed set for the following pass, k+1. The algorithm removes when no new sequential pattern is discovered in a pass, or no candidate sequence can be created.

GSP uses the Apriori property to shorten the set of candidates as follows. In the k-th pass, a series is a candidate only if each of its length-(k −1) subsequences is a sequential pattern discovered at the (k −1)-th pass.

A new scan of the database assemble the support for each candidate sequence and discovered a new set of sequential patterns, Lk. This set develops into the seed for the following pass. The algorithm removes when no sequential pattern is discovered in a pass or when no candidate sequence is created.

The Apriori-like sequential pattern mining techniques (based on candidate generate and test) can also be analysed by measuring a sequence database into vertical data format. In vertical data format, the database turn into a set of tuples of the form (itemset: (sequence_ID, event_ID)).

The event identifier provide as a timestamp inside a sequence. The event_ID of the ith itemset (or event) in a sequence is i. An itemset can appear in higher than one sequence. The set of (sequence ID, event ID) combine for a given itemset forms the ID_list of the itemset.

Updated on: 17-Feb-2022

486 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements