DNA Sequence Analysis Challenge

In molecular biology, DNA sequences are fundamental building blocks of life, composed of four nucleotides: 'A' (Adenine), 'C' (Cytosine), 'G' (Guanine), and 'T' (Thymine). Scientists often need to identify repetitive patterns in DNA sequences to understand genetic mutations, evolutionary relationships, and disease markers.

Your task is to analyze a DNA string and find all 10-letter sequences that appear more than once in the molecule. For example, in the sequence "ACGAATTCCGACGAATTCCG", the substring "ACGAATTCCG" appears twice.

Goal: Return all 10-character substrings that occur multiple times in the given DNA sequence. The order of results doesn't matter.

Input & Output

example_1.py โ€” Basic Case
$ Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
โ€บ Output: ["AAAAACCCCC", "CCCCCAAAAA"]
๐Ÿ’ก Note: The 10-letter sequence "AAAAACCCCC" appears at positions 0 and 10. The sequence "CCCCCAAAAA" appears at positions 6 and 16. These are the only two sequences that repeat.
example_2.py โ€” No Repeats
$ Input: s = "AAAAAAAAAA"
โ€บ Output: ["AAAAAAAAAA"]
๐Ÿ’ก Note: The only possible 10-letter sequence is "AAAAAAAAAA" which appears once at position 0, but since the string has exactly 10 characters, this sequence actually appears only once. Wait - let me recalculate: this sequence appears at position 0 only, so it should return empty. Actually, "AAAAAAAAAA" appears exactly once, so no repeats found.
example_3.py โ€” Short String
$ Input: s = "AAAAAAAAAA"
โ€บ Output: []
๐Ÿ’ก Note: The string has exactly 10 characters, so there's only one possible 10-letter sequence "AAAAAAAAAA" starting at position 0. Since it appears only once, no repeated sequences are found.

Visualization

Tap to expand
๐Ÿงฌ DNA Repeated Sequence DetectionDNA Sequence Analysis Process:Step 1: Sliding Window ApproachAAAAACCCCCAAAAACCCCCCAAAAAGGGTTT10-char windowHash Table (Real-time Tracking)Position 0: AAAAACCCCC โ†’ count: 1Position 1: AAAACCCCCA โ†’ count: 1...Position 10: AAAAACCCCC โ†’ count: 2 ๐ŸŽฏREPEATED!Results Foundโœ… AAAAACCCCC (positions 0, 10)โœ… CCCCCAAAAA (positions 6, 16)Total: 2 repeated sequencesโšก Algorithm Efficiencyโ€ข Time Complexity: O(n) - Single pass through sequenceโ€ข Space Complexity: O(n) - Hash table storageโ€ข Memory Usage: Stores unique 10-char sequences only๐Ÿ”ฌ Applications:Genetic research, mutation detection, DNA fingerprinting, evolutionary analysis
Understanding the Visualization
1
Initialize Scanner
Set up your 10-character window at the beginning of the DNA sequence
2
Slide and Record
Move the window one position at a time, recording each 10-letter sequence in your notebook
3
Detect Duplicates
When you see a sequence for the second time, immediately flag it as repeated
4
Continue Scanning
Keep sliding until you've covered the entire DNA sequence
Key Takeaway
๐ŸŽฏ Key Insight: Hash tables enable constant-time lookups, allowing us to detect repeated DNA sequences in a single pass through the data, making this approach optimal for genomic analysis.

Time & Space Complexity

Time Complexity
โฑ๏ธ
O(n)

We make one pass through the string, and each hash table operation takes O(1) time on average

n
2n
โœ“ Linear Growth
Space Complexity
O(n)

In worst case, we might store all unique 10-character substrings in the hash table

n
2n
โšก Linearithmic Space

Constraints

  • 1 โ‰ค s.length โ‰ค 105
  • s[i] is either 'A', 'C', 'G', or 'T'
  • Only valid DNA nucleotides are present in the input
Asked in
Amazon 45 Google 32 Microsoft 28 Meta 22
87.6K Views
Medium-High Frequency
~15 min Avg. Time
3.2K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen