Boyer Moore Algorithm



Boyer Moore Algorithm for Pattern Matching

The Boyer Moore Algorithm is used to determine whether a given pattern is present within a specified text or not. It follows a backward approach for pattern searching/matching. The task of searching a particular pattern within a given string is known as a pattern searching problem. For example, if the text is "THIS IS A SAMPLE TEXT" and the pattern is "TEXT", then the output should be 10, which is the index of the first occurrence of pattern in the given text.

This algorithm was developed by Robert Boyer and J Strother Moore in 1977. It is considered as the most efficient and widely used algorithm for pattern matching.

How does Boyer Moore Algorithm work?

In the previous chapters, we have seen the naive way to solve this problem which involves sliding the pattern over the text one by one and comparing each character. However, this approach is very slow, as it takes O(n*m) time, where 'n' is the length of the text and 'm' is the length of the pattern. The Boyer Moore algorithm improves this by preprocessing the pattern and using two heuristics to skip some comparisons that are not going to match.

The two heuristics are as follows −

  • Bad character heuristic − This heuristic uses a table that stores the last occurrence of each character in the pattern. When a mismatch occurs at some character(bad character) in the text, the algorithm checks if this character appears in the pattern. If it does, then it shifts the pattern such that the last occurrence of this character in the pattern aligns with the bad character in the text. If it does not, then it shifts the pattern past the bad character.

  • Good suffix heuristic − This heuristic uses another table that stores shift information when the bad heuristic fails. In this case, we look within the pattern till bad character become good suffix of the text. Then we shift onward to find the given pattern.

Boyer Solution

The Boyer-Moore algorithm combines these two heuristics by choosing the maximum shift suggested by them at each step. In this procedure, the substring or pattern is searched from the last character of the pattern. When a substring of the main string matches with a substring of the pattern, it moves to find other occurrences of the matched substring. If there is a mismatch, it applies the heuristics and shifts the pattern accordingly. The algorithm stops when it finds a complete match or when it reaches the end of the text.

The Boyer-Moore algorithm has a worst-case time complexity of O(nm), but, it can perform much better than that. In fact, in some cases, it can achieve a sublinear time complexity of O(n/m), which means that it can skip some characters in the text without comparing them. This happens when the pattern has no repeated characters or when it has a large alphabet size.

To illustrate how the Boyer-Moore algorithm works, let's consider an example −

Input:
main String: "AABAAABCEDBABCDDEBC" 
pattern: "ABC"
Output:
Pattern found at position: 5
Pattern found at position: 11

Example

In the following example, we are going to illustrate the working of Boyer-Moore Algorithm in various programming languages.

#include<stdio.h> 
#include<string.h> 
// Function for full suffix match
void computeFullShift(int shiftArr[], int longSuffArr[], char patrn[], int n) {
   int i = n;
   int j = n+1;
   longSuffArr[i] = j;
   while(i > 0) {
      // Searching right if (i-1)th and (j-1)th item are not the same
      while(j <= n && patrn[i-1] != patrn[j-1] ) {
          // to shift pattern from i to j
         if(shiftArr[j] == 0) {
            shiftArr[j] = j-i; 
         }
         // Updating long suffix value
         j = longSuffArr[j]; 
      }
      i--;
      j--;
      longSuffArr[i] = j;
   }  
}
// Function for good suffix match
void computeGoodSuffix(int shiftArr[], int longSuffArr[], char patrn[], int n) {
   int j;
   j = longSuffArr[0];
   // Looping through the pattern
   for(int i = 0; i<n; i++) {
      // setting shift to long suffix value
      if(shiftArr[i] == 0) {
         shiftArr[i] = j; 
         if(i == j) {
            // Updating long suffix value
            j = longSuffArr[j]; 
         }
      }
   }
}
// Function for searching the pattern
void searchPattern(char orgnStr[], char patrn[], int array[], int *index) {
   // length of the pattern
   int patLen = strlen(patrn); 
   // length of main string
   int strLen = strlen(orgnStr);
   int longerSuffArray[patLen+1];
   int shiftArr[patLen + 1];
   // Initializing shift array elements to 0
   for(int i = 0; i<=patLen; i++) {
      shiftArr[i] = 0;
   }
   // Calling computeFullShift function
   computeFullShift(shiftArr, longerSuffArray, patrn, patLen); 
   // Calling computeGoodSuffix function
   computeGoodSuffix(shiftArr, longerSuffArray, patrn, patLen); 
   int shift = 0;
   while(shift <= (strLen - patLen)) {
      int j = patLen - 1;
      // decrement j when pattern and main string character is matching
      while(j >= 0 && patrn[j] == orgnStr[shift+j]) {
         j--; 
      }
      if(j < 0) {
         (*index)++;
         // to store the position where pattern is found
         array[(*index)] = shift; 
         shift += shiftArr[0];
      }else {
          shift += shiftArr[j+1];
      }
   }
}
int main() {
   // original string    
   char orgnStr[] = "AABAAABCEDBABCDDEBC"; 
   // Pattern to be searched
   char patrn[] = "ABC"; 
   // Array to store the positions where pattern is found
   int locArray[strlen(orgnStr)]; 
   int index = -1;
   // Calling the searchPattern function
   searchPattern(orgnStr, patrn, locArray, &index); 
   // Printing the positions where pattern is found
   for(int i = 0; i <= index; i++) {
      printf("Pattern found at position: %d\n", locArray[i]);
   }
   return 0;
}
#include<iostream> 
using namespace std; 
// Function for full suffix match
void computeFullShift(int shiftArr[], int longSuffArr[], string patrn) {
   // length of pattern
   int n = patrn.size(); 
   int i = n;
   int j = n+1;
   longSuffArr[i] = j;
   while(i > 0) {
      // Searching right if (i-1)th and (j-1)th item are not the same
      while(j <= n && patrn[i-1] != patrn[j-1] ) {
          // to shift pattern from i to j
         if(shiftArr[j] == 0) {
            shiftArr[j] = j-i; 
         }
         // Updating long suffix value
         j = longSuffArr[j]; 
      }
      i--;
      j--;
      longSuffArr[i] = j;
   }  
}
// Function for good suffix match
void computeGoodSuffix(int shiftArr[], int longSuffArr[], string patrn) {
   // length of the pattern
   int n = patrn.size(); 
   int j;
   j = longSuffArr[0];
   // Looping through the pattern
   for(int i = 0; i<n; i++) {
      // setting shift to long suffix value
      if(shiftArr[i] == 0) {
         shiftArr[i] = j; 
         if(i == j) {
            // Updating long suffix value
            j = longSuffArr[j]; 
         }
      }
   }
}
// Function for searching the pattern
void searchPattern(string orgnStr, string patrn, int array[], int *index) {
   // length of the pattern
   int patLen = patrn.size(); 
   // length of main string
   int strLen = orgnStr.size();
   int longerSuffArray[patLen+1];
   int shiftArr[patLen + 1];
   // Initializing shift array elements to 0
   for(int i = 0; i<=patLen; i++) {
      shiftArr[i] = 0;
   }
   // Calling computeFullShift function
   computeFullShift(shiftArr, longerSuffArray, patrn); 
   // Calling computeGoodSuffix function
   computeGoodSuffix(shiftArr, longerSuffArray, patrn); 
   int shift = 0;
   while(shift <= (strLen - patLen)) {
      int j = patLen - 1;
      // decrement j when pattern and main string character is matching
      while(j >= 0 && patrn[j] == orgnStr[shift+j]) {
         j--; 
      }
      if(j < 0) {
         (*index)++;
         // to store the position where pattern is found
         array[(*index)] = shift; 
         shift += shiftArr[0];
      }else {
          shift += shiftArr[j+1];
      }
   }
}
int main() {
   // original string    
   string orgnStr = "AABAAABCEDBABCDDEBC"; 
   // Pattern to be searched
   string patrn = "ABC"; 
   // Array to store the positions where pattern is found
   int locArray[orgnStr.size()]; 
   int index = -1;
   // Calling the searchPattern function
   searchPattern(orgnStr, patrn, locArray, &index); 
   // Printing the positions where pattern is found
   for(int i = 0; i <= index; i++) {
      cout << "Pattern found at position: " << locArray[i]<<endl;
   }
}
public class BMalgo {
   // method for full suffix match
   static void computeFullShift(int[] shiftArr, int[] longSuffArr, String patrn) {
      // length of pattern
      int n = patrn.length();
      int i = n;
      int j = n+1;
      longSuffArr[i] = j;
      while(i > 0) {
         // Searching right if (i-1)th and (j-1)th item are not the same
         while(j <= n && patrn.charAt(i-1) != patrn.charAt(j-1)) {
            // to shift pattern from i to j
            if(shiftArr[j] == 0) {
               shiftArr[j] = j-i;
            }
            // Updating long suffix value
            j = longSuffArr[j];
         }
         i--;
         j--;
         longSuffArr[i] = j;
      }
   }
   // method for good suffix match
   static void computeGoodSuffix(int[] shiftArr, int[] longSuffArr, String patrn) {
      // length of the pattern
      int n = patrn.length();
      int j;
      j = longSuffArr[0];
      // Looping through the pattern
      for(int i = 0; i<n; i++) {
         // setting shift to long suffix value
         if(shiftArr[i] == 0) {
            shiftArr[i] = j;
            if(i == j) {
               // Updating long suffix value
               j = longSuffArr[j];
            }
         }
      }
   }
   // method for searching the pattern
   static void searchPattern(String orgnStr, String patrn, int[] array, int[] index) {
      // length of the pattern
      int patLen = patrn.length();
      // length of main string
      int strLen = orgnStr.length();
      int[] longerSuffArray = new int[patLen+1];
      int[] shiftArr = new int[patLen + 1];
      // Initializing shift array elements to 0
      for(int i = 0; i<=patLen; i++) {
         shiftArr[i] = 0;
      }
      // Calling computeFullShift method
      computeFullShift(shiftArr, longerSuffArray, patrn);
      // Calling computeGoodSuffix method
      computeGoodSuffix(shiftArr, longerSuffArray, patrn);
      int shift = 0;
      while(shift <= (strLen - patLen)) {
         int j = patLen - 1;
         // decrement j when pattern and main string character is matching
         while(j >= 0 && patrn.charAt(j) == orgnStr.charAt(shift+j)) {
            j--;
         }
         if(j < 0) {
            index[0]++;
            // to store the position where pattern is found
            array[index[0]] = shift;
            shift += shiftArr[0];
         }else {
            shift += shiftArr[j+1];
         }
      }
   }
   public static void main(String[] args) {
      // original string
      String orgnStr = "AABAAABCEDBABCDDEBC";
      // Pattern to be searched
      String patrn = "ABC";
      // Array to store the positions where pattern is found
      int[] locArray = new int[orgnStr.length()];
      int[] index = {-1};
      // Calling the searchPattern method
      searchPattern(orgnStr, patrn, locArray, index);
      // Printing the positions where pattern is found
      for(int i = 0; i <= index[0]; i++) {
         System.out.println("Pattern found at position: " + locArray[i]);
      }
   }
}
# Function for full suffix match
def compute_full_shift(shift_arr, long_suff_arr, patrn):
    # length of pattern
    n = len(patrn)
    i = n
    j = n+1
    long_suff_arr[i] = j
    while i > 0:
        # Searching right if (i-1)th and (j-1)th item are not the same
        while j <= n and patrn[i-1] != patrn[j-1]:
            # to shift pattern from i to j
            if shift_arr[j] == 0:
                shift_arr[j] = j-i
            # Updating long suffix value
            j = long_suff_arr[j]
        i -= 1
        j -= 1
        long_suff_arr[i] = j
# Function for good suffix match
def compute_good_suffix(shift_arr, long_suff_arr, patrn):
    # length of the pattern
    n = len(patrn)
    j = long_suff_arr[0]
    # Looping through the pattern
    for i in range(n):
        # setting shift to long suffix value
        if shift_arr[i] == 0:
            shift_arr[i] = j
            if i == j:
                # Updating long suffix value
                j = long_suff_arr[j]

# Function for searching the pattern
def search_pattern(orgn_str, patrn, array, index):
    # length of the pattern
    pat_len = len(patrn)
    # length of main string
    str_len = len(orgn_str)
    longer_suff_array = [0]*(pat_len+1)
    shift_arr = [0]*(pat_len + 1)
    # Initializing shift array elements to 0
    for i in range(pat_len+1):
        shift_arr[i] = 0
    # Calling compute_full_shift function
    compute_full_shift(shift_arr, longer_suff_array, patrn)
    # Calling compute_good_suffix function
    compute_good_suffix(shift_arr, longer_suff_array, patrn)
    shift = 0
    while shift <= (str_len - pat_len):
        j = pat_len - 1
        # decrement j when pattern and main string character is matching
        while j >= 0 and patrn[j] == orgn_str[shift+j]:
            j -= 1
        if j < 0:
            index[0] += 1
            # to store the position where pattern is found
            array[index[0]] = shift
            shift += shift_arr[0]
        else:
            shift += shift_arr[j+1]

# original string
orgn_str = "AABAAABCEDBABCDDEBC"
# Pattern to be searched
patrn = "ABC"
# Array to store the positions where pattern is found
loc_array = [0]*len(orgn_str)
index = [-1]
# Calling the search_pattern function
search_pattern(orgn_str, patrn, loc_array, index)
# Printing the positions where pattern is found
for i in range(index[0]+1):
    print("Pattern found at position: ", loc_array[i])

Output

Pattern found at position: 5
Pattern found at position: 11
Advertisements