Repeated DNA Sequences - Problem
DNA Sequence Analysis Challenge
In molecular biology, DNA sequences are fundamental building blocks of life, composed of four nucleotides:
Your task is to analyze a DNA string and find all 10-letter sequences that appear more than once in the molecule. For example, in the sequence
Goal: Return all 10-character substrings that occur multiple times in the given DNA sequence. The order of results doesn't matter.
In molecular biology, DNA sequences are fundamental building blocks of life, composed of four nucleotides:
'A' (Adenine), 'C' (Cytosine), 'G' (Guanine), and 'T' (Thymine). Scientists often need to identify repetitive patterns in DNA sequences to understand genetic mutations, evolutionary relationships, and disease markers.Your task is to analyze a DNA string and find all 10-letter sequences that appear more than once in the molecule. For example, in the sequence
"ACGAATTCCGACGAATTCCG", the substring "ACGAATTCCG" appears twice.Goal: Return all 10-character substrings that occur multiple times in the given DNA sequence. The order of results doesn't matter.
Input & Output
example_1.py โ Basic Case
$
Input:
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
โบ
Output:
["AAAAACCCCC", "CCCCCAAAAA"]
๐ก Note:
The 10-letter sequence "AAAAACCCCC" appears at positions 0 and 10. The sequence "CCCCCAAAAA" appears at positions 6 and 16. These are the only two sequences that repeat.
example_2.py โ No Repeats
$
Input:
s = "AAAAAAAAAA"
โบ
Output:
["AAAAAAAAAA"]
๐ก Note:
The only possible 10-letter sequence is "AAAAAAAAAA" which appears once at position 0, but since the string has exactly 10 characters, this sequence actually appears only once. Wait - let me recalculate: this sequence appears at position 0 only, so it should return empty. Actually, "AAAAAAAAAA" appears exactly once, so no repeats found.
example_3.py โ Short String
$
Input:
s = "AAAAAAAAAA"
โบ
Output:
[]
๐ก Note:
The string has exactly 10 characters, so there's only one possible 10-letter sequence "AAAAAAAAAA" starting at position 0. Since it appears only once, no repeated sequences are found.
Visualization
Tap to expand
Understanding the Visualization
1
Initialize Scanner
Set up your 10-character window at the beginning of the DNA sequence
2
Slide and Record
Move the window one position at a time, recording each 10-letter sequence in your notebook
3
Detect Duplicates
When you see a sequence for the second time, immediately flag it as repeated
4
Continue Scanning
Keep sliding until you've covered the entire DNA sequence
Key Takeaway
๐ฏ Key Insight: Hash tables enable constant-time lookups, allowing us to detect repeated DNA sequences in a single pass through the data, making this approach optimal for genomic analysis.
Time & Space Complexity
Time Complexity
O(n)
We make one pass through the string, and each hash table operation takes O(1) time on average
โ Linear Growth
Space Complexity
O(n)
In worst case, we might store all unique 10-character substrings in the hash table
โก Linearithmic Space
Constraints
- 1 โค s.length โค 105
- s[i] is either 'A', 'C', 'G', or 'T'
- Only valid DNA nucleotides are present in the input
๐ก
Explanation
AI Ready
๐ก Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code