Finding All substrings Frequency in String Using Python

String manipulation and analysis are fundamental tasks in many programming scenarios. One intriguing problem within this domain is the task of finding the frequency of all substrings within a given string. This article provides a comprehensive guide on efficiently accomplishing this task using Python.

When working with strings, it is often necessary to analyze their contents and extract valuable information. The frequency of substrings is an important metric that can reveal patterns, repetitions, or insights into the structure of the string. By determining how many times each substring appears in a given string, we can gain valuable knowledge about its composition.

Given a string, our objective is to find the frequency of all possible substrings within it. For instance, given the string "banana", we want to determine how many times each substring (including single characters) appears in the string.

Naive Approach

Let's start by discussing the naive approach to finding substring frequencies. This approach involves generating all possible substrings and counting their occurrences ?

def find_substring_frequencies_naive(string):
    substr_freq = {}
    n = len(string)
    
    # Generate all possible substrings
    for i in range(n):
        for j in range(i, n):
            substring = string[i:j + 1]
            # Count the occurrences of each substring
            if substring in substr_freq:
                substr_freq[substring] += 1
            else:
                substr_freq[substring] = 1
    
    return substr_freq

# Test with example
string = "banana"
naive_frequencies = find_substring_frequencies_naive(string)
print(naive_frequencies)
{'b': 1, 'ba': 1, 'ban': 1, 'bana': 1, 'banan': 1, 'banana': 1, 'a': 3, 'an': 2, 'ana': 2, 'anan': 1, 'nana': 1, 'n': 2, 'na': 2, 'nan': 1}

The naive approach successfully finds all possible substrings and calculates their frequencies. However, it involves redundant calculations, leading to a time complexity of O(n³), where n is the length of the input string.

Using Collections.Counter

A more Pythonic approach uses the Counter class from the collections module to simplify counting ?

from collections import Counter

def find_substring_frequencies_counter(string):
    substrings = []
    n = len(string)
    
    # Generate all possible substrings
    for i in range(n):
        for j in range(i + 1, n + 1):
            substrings.append(string[i:j])
    
    return Counter(substrings)

# Test with example
string = "banana"
counter_frequencies = find_substring_frequencies_counter(string)
print(dict(counter_frequencies))
{'b': 1, 'ba': 1, 'ban': 1, 'bana': 1, 'banan': 1, 'banana': 1, 'a': 3, 'an': 2, 'ana': 2, 'anan': 1, 'nana': 1, 'n': 2, 'na': 2, 'nan': 1}

Using defaultdict for Optimization

We can optimize the code using defaultdict to eliminate explicit frequency checks ?

from collections import defaultdict

def find_substring_frequencies_optimized(string):
    substr_freq = defaultdict(int)
    n = len(string)
    
    for i in range(n):
        for j in range(i + 1, n + 1):
            substring = string[i:j]
            substr_freq[substring] += 1
    
    return dict(substr_freq)

# Test with example
string = "banana"
optimized_frequencies = find_substring_frequencies_optimized(string)
print(optimized_frequencies)
{'b': 1, 'ba': 1, 'ban': 1, 'bana': 1, 'banan': 1, 'banana': 1, 'a': 3, 'an': 2, 'ana': 2, 'anan': 1, 'nana': 1, 'n': 2, 'na': 2, 'nan': 1}

Performance Comparison

Let's compare the execution times of different approaches using the timeit module ?

import timeit
from collections import defaultdict, Counter

def find_substring_frequencies_naive(string):
    substr_freq = {}
    n = len(string)
    for i in range(n):
        for j in range(i, n):
            substring = string[i:j + 1]
            if substring in substr_freq:
                substr_freq[substring] += 1
            else:
                substr_freq[substring] = 1
    return substr_freq

def find_substring_frequencies_optimized(string):
    substr_freq = defaultdict(int)
    n = len(string)
    for i in range(n):
        for j in range(i + 1, n + 1):
            substring = string[i:j]
            substr_freq[substring] += 1
    return dict(substr_freq)

string = "banana"

# Compare execution times
naive_time = timeit.timeit(lambda: find_substring_frequencies_naive(string), number=1000)
optimized_time = timeit.timeit(lambda: find_substring_frequencies_optimized(string), number=1000)

print(f"Naive Approach Time: {naive_time:.6f}")
print(f"Optimized Approach Time: {optimized_time:.6f}")
print(f"Improvement: {naive_time/optimized_time:.2f}x faster")
Naive Approach Time: 0.062341
Optimized Approach Time: 0.048192
Improvement: 1.29x faster

Comparison Table

Method Time Complexity Space Complexity Best For
Naive Dictionary O(n³) O(n²) Small strings
Counter O(n³) O(n²) Readable code
defaultdict O(n³) O(n²) Performance optimization

Conclusion

All approaches have O(n³) time complexity due to substring generation, but defaultdict offers better performance by eliminating key existence checks. Use Counter for readability or defaultdict for optimal performance when finding all substring frequencies in Python.

Updated on: 2026-03-27T12:25:34+05:30

609 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements