- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Aho-Corasick Algorithm for Pattern Searching in C++
In this problem, we are given an input string and an array arr[]. Our task is to find all occurrences of all words of the array in the string. For this, we will be using the Aho-Corasick Algorithm for Pattern Searching.
String and pattern searching is an important thing in programming. And in programming, the better the algorithm the more practical uses it can have. Aho-Corasick algorithm is a very important and powerful algorithm that makes string searching easy. It is kind of a dictionary matching algorithm, matching all the strings simultaneously. The algorithm uses the Trie data structure for its implementation.
Trie data structure
Trie is a kind of a prefix tree or a digital search tree, where each edge is labeled by some letter (each outgoing edge having different letters).
Let’s take an example to understand Aho-Corasick algorithm
Input
string = "bheythisghisanexample" arr[] = {"hey", "this", "is", "an", “example”}
Output
Word hey starts from 2 Word this starts from 5 Word is starts from 11 Word an starts from 13 Word example starts from 15
The time complexity of this algorithm is O(N+L+Z), where N= Length of input of string/text
L= Length of keywords (words in the array)
Z= number of matches.
Implementation
Aho-Corasick algorithm can be constructed with these easy steps
Construct the trie using queue so that we can pop each character in the queue as a node od ‘trie’.
Construct failure links (suffix links) as an array which can store the next and current character
Construct output links as an array to store the matching words
Build a Traverse function (FindNextState) to check all the characters.
Failure Link (suffix link) − When we hit a part of the string where we cannot continue to read characters, we fall back by following suffix links to try to preserve as much context as possible. In brief, it stores all edges that are followed when a current character doesn't have an edge in the Trie.
Output Link − It always pointing to the node corresponding to the longest word that is present in the current state, we ensure that we chain together all the patterns using output links.
Example
#include<iostream> #include <string.h> #include<algorithm> #include<queue> using namespace std; const int MaxStates = 6 * 50 + 10; const int MaxChars = 26; int OccurenceOfWords[MaxStates]; int FF[MaxStates]; int GotoFunction[MaxStates][MaxChars]; int BuildMatchingMachine(const vector<string> &words, char lowestChar = 'a', char highestChar = 'z'){ memset(OccurenceOfWords, 0, sizeof OccurenceOfWords); memset(FF, -1, sizeof FF); memset(GotoFunction, -1, sizeof GotoFunction); int states = 1; for (int i = 0; i < words.size(); ++i){ const string &keyword = words[i]; int currentState = 0; for (int j = 0; j < keyword.size(); ++j){ int c = keyword[j] - lowestChar; if (GotoFunction[currentState][c] == -1){ GotoFunction[currentState][c] = states++; } currentState = GotoFunction[currentState][c]; } OccurenceOfWords[currentState] |= (1 << i); } for (int c = 0; c < MaxChars; ++c){ if (GotoFunction[0][c] == -1){ GotoFunction[0][c] = 0; } } queue<int> q; for (int c = 0; c <= highestChar - lowestChar; ++c){ if (GotoFunction[0][c] != -1 && GotoFunction[0][c] != 0){ FF[GotoFunction[0][c]] = 0; q.push(GotoFunction[0][c]); } } while (q.size()){ int state = q.front(); q.pop(); for (int c = 0; c <= highestChar - lowestChar; ++c){ if (GotoFunction[state][c] != -1){ int failure = FF[state]; while (GotoFunction[failure][c] == -1){ failure = FF[failure]; } failure = GotoFunction[failure][c]; FF[GotoFunction[state][c]] = failure; OccurenceOfWords[GotoFunction[state][c]] |= OccurenceOfWords[failure]; q.push(GotoFunction[state][c]); } } } return states; } int FindNextState(int currentState, char nextInput, char lowestChar = 'a'){ int answer = currentState; int c = nextInput - lowestChar; while (GotoFunction[answer][c] == -1){ answer = FF[answer]; } return GotoFunction[answer][c]; } vector<int> FindWordCount(string str, vector<string> keywords, char lowestChar = 'a', char highestChar = 'z') { BuildMatchingMachine(keywords, lowestChar, highestChar); int currentState = 0; vector<int> retVal; for (int i = 0; i < str.size(); ++i){ currentState = FindNextState(currentState, str[i], lowestChar); if (OccurenceOfWords[currentState] == 0) continue; for (int j = 0; j < keywords.size(); ++j){ if (OccurenceOfWords[currentState] & (1 << j)){ retVal.insert(retVal.begin(), i - keywords[j].size() + 1); } } } return retVal; } int main(){ vector<string> keywords; keywords.push_back("All"); keywords.push_back("she"); keywords.push_back("is"); string str = "Allisheall"; cout<<"The occurrences of all words in the string ' "<<str<<" ' are \n"; vector<int> states = FindWordCount(str, keywords); for(int i=0; i < keywords.size(); i++){ cout<<"Word "<<keywords.at(i)<<' '; cout<<"starts at "<<states.at(i)+1<<' '; cout<<"And ends at "<<states.at(i)+keywords.at(i).size()+1<<endl; } }
Output
The occurrences of all words in the string ' Allisheall ' are Word All starts at 5 And ends at 8 Word she starts at 4 And ends at 7 Word is starts at 1 And ends at 3