Repeated DNA Sequences - Problem

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'. For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA. Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

You may return the answer in any order.

Input & Output

Example 1 — Basic Repeated Sequences

$ Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

› Output: ["AAAAACCCCC","CCCCCAAAAA"]

💡 Note: AAAAACCCCC appears at positions 0 and 10. CCCCCAAAAA appears at positions 5 and 15. Both sequences are 10 characters long and occur more than once.

Example 2 — No Repeats

$ Input: s = "AAAAAAAAAA"

› Output: []

💡 Note: The string has exactly 10 characters, so there is only one possible 10-letter sequence "AAAAAAAAAA" starting at position 0. Since it appears only once (not more than once), the result is an empty array.

Example 3 — Short String

$ Input: s = "ACGT"

› Output: []

💡 Note: String is too short (4 characters) to contain any 10-letter sequences, so return empty array.

Constraints

1 ≤ s.length ≤ 10⁵
s[i] is either 'A', 'C', 'G', or 'T'

Visualization

Tap to expand

Asked in

Li LinkedIn 8 a Amazon 6 M Microsoft 4

The key insight is to use a sliding window to extract all 10-character substrings and track their occurrences with a hash map. The optimal approach uses bit manipulation for O(1) rolling hash updates. Best time complexity: O(n), space: O(n).

Common Approaches

✓ Bit Manipulation with Rolling Hash

⏱️ Time: O(n) Space: O(n)

Encode each DNA nucleotide as 2 bits (A=00, C=01, G=10, T=11), so a 10-character sequence fits in 20 bits. Use rolling hash technique to efficiently compute hash values as we slide the window, avoiding string operations.

Brute Force - Compare All Substrings

⏱️ Time: O(n²) Space: O(1)

Generate all possible 10-character substrings from the DNA sequence. For each substring, check if it appears elsewhere in the string by comparing with all other substrings.

Sliding Window with Hash Map

⏱️ Time: O(n) Space: O(n)

Slide a window of size 10 across the DNA string. For each position, extract the 10-character substring and use a hash map to track how many times each sequence appears. Return sequences that appear more than once.

Bit Manipulation with Rolling Hash — Algorithm Steps

Encode DNA characters to 2-bit values
Build initial 20-bit hash for first 10 characters
Use rolling hash to update hash value as window slides
Track hash occurrences and decode repeated sequences

Visualization

Tap to expand

Step-by-Step Walkthrough

Encode DNA

A=00, C=01, G=10, T=11 (2 bits each)

Rolling Hash

Slide window updating 20-bit hash value

Track & Decode

Count hash occurrences, decode repeated ones

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_SEQUENCES 100000

struct Entry {
    int hash;
    int count;
};

int charToNum(char c) {
    switch(c) {
        case 'A': return 0;
        case 'C': return 1;
        case 'G': return 2;
        case 'T': return 3;
    }
    return 0;
}

char numToChar(int n) {
    char chars[] = "ACGT";
    return chars[n];
}

char** solution(char* s, int* returnSize) {
    int n = strlen(s);
    if (n < 10) {
        *returnSize = 0;
        return NULL;
    }
    
    int hashVal = 0;
    for (int i = 0; i < 10; i++) {
        hashVal = (hashVal << 2) | charToNum(s[i]);
    }
    
    struct Entry entries[MAX_SEQUENCES];
    int entryCount = 0;
    entries[0].hash = hashVal;
    entries[0].count = 1;
    entryCount = 1;
    
    char** result = (char**)malloc(1000 * sizeof(char*));
    *returnSize = 0;
    int mask = (1 << 20) - 1;
    
    for (int i = 10; i < n; i++) {
        hashVal = ((hashVal << 2) & mask) | charToNum(s[i]);
        
        int found = -1;
        for (int j = 0; j < entryCount; j++) {
            if (entries[j].hash == hashVal) {
                found = j;
                break;
            }
        }
        
        if (found == -1) {
            entries[entryCount].hash = hashVal;
            entries[entryCount].count = 1;
            entryCount++;
        } else {
            entries[found].count++;
            if (entries[found].count == 2) {
                char* decoded = (char*)malloc(11 * sizeof(char));
                int temp = hashVal;
                for (int k = 9; k >= 0; k--) {
                    decoded[k] = numToChar(temp & 3);
                    temp >>= 2;
                }
                decoded[10] = '\0';
                result[*returnSize] = decoded;
                (*returnSize)++;
            }
        }
    }
    
    return result;
}

int main() {
    char s[100001];
    fgets(s, sizeof(s), stdin);
    s[strcspn(s, "\n")] = 0;
    
    int returnSize;
    char** result = solution(s, &returnSize);
    
    printf("[");
    for (int i = 0; i < returnSize; i++) {
        printf("\"%s\"", result[i]);
        if (i < returnSize - 1) printf(",");
    }
    printf("]\n");
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n)

Single pass with O(1) rolling hash updates per position

✓ Linear Growth

Space Complexity

O(n)

Hash map stores up to n-9 integer keys plus result strings

⚡ Linearithmic Space

87.5K Views

Medium Frequency

~25 min Avg. Time

3.2K Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Repeated DNA Sequences - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Bit Manipulation with Rolling Hash — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler