Find the Longest Common Substring using Binary search and Rolling Hash


In this problem, we will find the longest common substring using the binary search and rolling hash algorithm.

The binary search is an efficient technique for searching values in the sorted array. Here, we will use it to find the maximum length of the common substring. The rolling hash algorithm is used to calculate the hash value for the string. Also, it calculates the hash value for the next substring from the previous substring’s hash value when we add or remove a character from the previous substring.

Problem statement − We have given two strings named the first and second. We need to find the longest common substring using the binary search and rolling hash algorithm.

Sample examples

Input

first = "qwertdfghj"; second = "aqwertdfj";

Output

7

Explanation − The longest common substring is ‘qwertdfj’.

Input

first = "abcd"; second = "pqrs";

Output

0

Explanation − There is no common substring. So, it prints 0.

Input

first = "mnokjl"; second = "mnokjl";

Output

6

Explanation − Both strings are same. So, it prints the length of the string.

Approach 1

In this approach, we will take the 0 as a minimum value and the string’s length as a maximum value for the binary search technique. We will take the middle length and find the rolling hash value for each substring of the middle length of the first string.

Next, we will find the rolling hash value of each substring of the middle length of the second string. If the hash value matches any hash value of the substring of the first string, we found the common substring of the middle length. So, we need to find the maximum length of a common substring in the right part. Otherwise, we need to find the maximum length in the left part.

Algorithm

Step 1 − Initialize the ‘mini’ with 0 and ‘maxi’ with the maximum length of the first and second string’s length, representing the minimum and maximum range of the binary search technique.

Step 2 − Make traversals using the while loop until the value of ‘mini’ is less than a ‘maxi’.

Step 3− Take the middle length using the ‘mini’ and ‘maxi’ values. Also, initialize the ‘isPresent’ boolean value with false to track whether the middle length's common substring is present.

Step 4 − Initialize the set named ‘hash values’ to store the hash value of the substrings of the first string.

Step 5 − Take each substring of the middle length of the first string, calculate the hash value, and insert it into the ‘hashValues’ set.

Step 5.1 − In the calculate function, ‘val’ with 0 to store the hash value of the previous substring, and ‘powVal’ to store the power value for each iteration.

Step 5.2 − Start traversing the string, and take a character value between 1 to 26.

Step 5.3− Multiply the ‘powVal’ with a character value, add it to the ‘val’, take modulo with the modulus, and store it back to the ‘val’ variable.

Step 5.4 − Multiply the ‘powVal’ with 26 and take its modulo with the ‘modulus’.

Step 5.5 − Return ‘val’ at the end of the calculate() function.

Step 6 − Next, traverse the second string to take each substring of the middle length and calculate the hash value.

Step 7 − If a hash value related to any substring of the second string exists in the ‘hashValues’ set, update the ‘isPresent’ to true, and break the loop.

Step 8 − If ‘isPresent’ is true, update ‘mini’ with ‘middle’. Otherwise, update the ‘maxi’ with the middle – 1.

Step 9 − At last, return the value of the ‘mini’ variable.

Example

#include <iostream>
#include <unordered_set>
#include <cmath>
using namespace std;

long long modulus = 1000000007;
    // Calculate the hash value for the given string
long long calculate(string str) {
    long long val = 0;
    long long powVal = 1;
    // Traverse each character of the string
    for (int p = 0; p < str.length(); p++) {
        // Get character value between 1 to 26
        long long ch = str[p] - 'A' + 1;
        // Calculate hash value
        val = (val + ch * powVal) % ::modulus;
        powVal = (powVal * 26) % ::modulus;
    }
    return val;
}
int maxSubStr(string first, string second) {
    int mini = 0, maxi = min(first.length(), second.length());
    // Binary search iterations until mini is less than the maxi
    while (mini < maxi) {
        // Get middle value
        int middle = (mini + maxi + 1) / 2;
        bool isPresent = false;
        // To store hash values of substring
        unordered_set<long long> hashValues;
        for (int p = 0; p + middle <= first.length(); p++) { // Calculating hash value for each substring of the middle length of first string
            long long hashVal = calculate(first.substr(p, middle));
            // Insert to map
            hashValues.insert(hashVal);
        }
        for (int p = 0; p + middle <= second.length(); p++) {
            long long hashVal = calculate(second.substr(p, middle));
            // If the hash value exists in the map, a common substring of length K exists.
            if (hashValues.count(hashVal)) {
                isPresent = true;
                break;
            }
        }
        // Update the pointer according to whether a string of middle length is present or not
        if (isPresent) {
            mini = middle;
        } else {
            maxi = middle - 1;
        }
    }
    return mini;
}
int main() {
    string first = "mnokjl";
    string second = "mnokjl";
    cout << "The length of the longest common substring is " << maxSubStr(first, second) << endl;
    return 0;
}

Output

The length of the longest common substring is 6

Time complexity − O(log(min(N,M)) * (O(N*N) + O(M*M)), where O(log(min(N,M)) is for binary search, and O(N*N) + O(M*M) is for calculating the rolling hash of each substring.

Space complexity − O(N + M) to store the hash value for each substring.

We learned to use the rolling hash and binary search technique to find the longest common substring’s length. It is an efficient approach than the recursive approach of finding the maximum length of the common substring. However, the dynamic programming approach is more efficient than the above approach.

Updated on: 17-Jul-2023

140 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements