Aho-Corasick Algorithm for Pattern Searching in C++

C++Server Side ProgrammingProgramming

In this problem, we are given an input string and an array arr[]. Our task is to find all occurrences of all words of the array in the string. For this, we will be using the Aho-Corasick Algorithm for Pattern Searching.

String and pattern searching is an important thing in programming. And in programming, the better the algorithm the more practical uses it can have. Aho-Corasick algorithm is a very important and powerful algorithm that makes string searching easy. It is kind of a dictionary matching algorithm, matching all the strings simultaneously. The algorithm uses the Trie data structure for its implementation.

Trie data structure

Trie is a kind of a prefix tree or a digital search tree, where each edge is labeled by some letter (each outgoing edge having different letters).

Let’s take an example to understand Aho-Corasick algorithm

Input 

string = "bheythisghisanexample"
arr[] = {"hey", "this", "is", "an", “example”}

Output 

Word hey starts from 2
Word this starts from 5
Word is starts from 11
Word an starts from 13
Word example starts from 15

The time complexity of this algorithm is O(N+L+Z), where N= Length of input of string/text

L= Length of keywords (words in the array)

Z= number of matches.

Implementation

Aho-Corasick algorithm can be constructed with these easy steps

  • Construct the trie using queue so that we can pop each character in the queue as a node od ‘trie’.

  • Construct failure links (suffix links) as an array which can store the next and current character

  • Construct output links as an array to store the matching words

  • Build a Traverse function (FindNextState) to check all the characters.

Failure Link (suffix link) − When we hit a part of the string where we cannot continue to read characters, we fall back by following suffix links to try to preserve as much context as possible. In brief, it stores all edges that are followed when a current character doesn't have an edge in the Trie.

Output Link − It always pointing to the node corresponding to the longest word that is present in the current state, we ensure that we chain together all the patterns using output links.

Example

 Live Demo

#include<iostream>
#include <string.h>
#include<algorithm>
#include<queue>
using namespace std;
const int MaxStates = 6 * 50 + 10;
const int MaxChars = 26;
int OccurenceOfWords[MaxStates];
int FF[MaxStates];
int GotoFunction[MaxStates][MaxChars];
int BuildMatchingMachine(const vector<string> &words, char lowestChar = 'a', char highestChar = 'z'){
   memset(OccurenceOfWords, 0, sizeof OccurenceOfWords);
   memset(FF, -1, sizeof FF);
   memset(GotoFunction, -1, sizeof GotoFunction);
   int states = 1;
   for (int i = 0; i < words.size(); ++i){
      const string &keyword = words[i];
      int currentState = 0;
      for (int j = 0; j < keyword.size(); ++j){
         int c = keyword[j] - lowestChar;
         if (GotoFunction[currentState][c] == -1){
            GotoFunction[currentState][c] = states++;
         }
         currentState = GotoFunction[currentState][c];
      }
      OccurenceOfWords[currentState] |= (1 << i);
   }
   for (int c = 0; c < MaxChars; ++c){
      if (GotoFunction[0][c] == -1){
         GotoFunction[0][c] = 0;
      }
   }
   queue<int> q;
   for (int c = 0; c <= highestChar - lowestChar; ++c){
      if (GotoFunction[0][c] != -1 && GotoFunction[0][c] != 0){
         FF[GotoFunction[0][c]] = 0;
         q.push(GotoFunction[0][c]);
      }
   }
   while (q.size()){
      int state = q.front();
      q.pop();
      for (int c = 0; c <= highestChar - lowestChar; ++c){
         if (GotoFunction[state][c] != -1){
            int failure = FF[state];
            while (GotoFunction[failure][c] == -1){
               failure = FF[failure];
            }
            failure = GotoFunction[failure][c];
            FF[GotoFunction[state][c]] = failure;
            OccurenceOfWords[GotoFunction[state][c]] |= OccurenceOfWords[failure];
            q.push(GotoFunction[state][c]);
         }
      }
   }
   return states;
}
int FindNextState(int currentState, char nextInput, char lowestChar = 'a'){
   int answer = currentState;
   int c = nextInput - lowestChar;
   while (GotoFunction[answer][c] == -1){
      answer = FF[answer];
   }
   return GotoFunction[answer][c];
}
vector<int> FindWordCount(string str, vector<string> keywords, char lowestChar = 'a', char highestChar = 'z') {
   BuildMatchingMachine(keywords, lowestChar, highestChar);
   int currentState = 0;
   vector<int> retVal;
   for (int i = 0; i < str.size(); ++i){
      currentState = FindNextState(currentState, str[i], lowestChar);
      if (OccurenceOfWords[currentState] == 0)
         continue;
      for (int j = 0; j < keywords.size(); ++j){
         if (OccurenceOfWords[currentState] & (1 << j)){
            retVal.insert(retVal.begin(), i - keywords[j].size() + 1);
         }
      }
   }
   return retVal;
}
int main(){
   vector<string> keywords;
   keywords.push_back("All");
   keywords.push_back("she");
   keywords.push_back("is");
   string str = "Allisheall";
   cout<<"The occurrences of all words in the string ' "<<str<<" ' are \n";
   vector<int> states = FindWordCount(str, keywords);
   for(int i=0; i < keywords.size(); i++){
      cout<<"Word "<<keywords.at(i)<<' ';
      cout<<"starts at "<<states.at(i)+1<<' ';
      cout<<"And ends at "<<states.at(i)+keywords.at(i).size()+1<<endl;
   }
}

Output

The occurrences of all words in the string ' Allisheall ' are
Word All starts at 5 And ends at 8
Word she starts at 4 And ends at 7
Word is starts at 1 And ends at 3
raja
Published on 06-Aug-2020 11:20:16
Advertisements