DNA Pattern Recognition - Problem
๐งฌ Dive into the fascinating world of bioinformatics! As a computational biologist, you're analyzing DNA sequences to identify important genetic patterns that could reveal insights about different species.
You have a database table Samples containing DNA sequences from various species. Your task is to scan each sequence and detect four crucial biological patterns:
- Start Codon: Sequences beginning with
ATG(signals protein synthesis start) - Stop Codons: Sequences ending with
TAA,TAG, orTGA(signals protein synthesis end) - ATAT Motif: Sequences containing the repeating pattern
ATAT - Triple G Pattern: Sequences with at least 3 consecutive G's (like
GGGorGGGG)
For each DNA sample, return boolean flags indicating which patterns are present, ordered by sample_id.
Input: Table with sample_id, dna_sequence (A,T,G,C characters), and species
Output: All columns plus four pattern detection flags: has_start, has_stop, has_atat, has_ggg
Input & Output
example_1.py โ Basic Pattern Detection
$
Input:
Samples: [(1, 'ATGCTAGCTAGCTAA', 'Human'), (2, 'GGGTCAATCATC', 'Human')]
โบ
Output:
[(1, 'ATGCTAGCTAGCTAA', 'Human', 1, 1, 0, 0), (2, 'GGGTCAATCATC', 'Human', 0, 0, 0, 1)]
๐ก Note:
Sample 1 has start codon ATG and stop codon TAA. Sample 2 has triple G pattern GGG but no other patterns.
example_2.py โ ATAT Motif Detection
$
Input:
Samples: [(3, 'ATATATCGTAGCTA', 'Human'), (6, 'ATATCGCGCTAG', 'Zebrafish')]
โบ
Output:
[(3, 'ATATATCGTAGCTA', 'Human', 0, 0, 1, 0), (6, 'ATATCGCGCTAG', 'Zebrafish', 0, 1, 1, 0)]
๐ก Note:
Both sequences contain ATAT motif. Sample 6 also ends with TAG stop codon.
example_3.py โ Complex Pattern Combination
$
Input:
Samples: [(4, 'ATGGGGTCATCATAA', 'Mouse')]
โบ
Output:
[(4, 'ATGGGGTCATCATAA', 'Mouse', 1, 1, 0, 1)]
๐ก Note:
This sequence demonstrates multiple patterns: starts with ATG, contains GGGG (4 consecutive Gs), and ends with TAA.
Constraints
- 1 โค number of samples โค 103
- 1 โค length of dna_sequence โค 104
- DNA sequences contain only characters A, T, G, C
- sample_id values are unique integers
- species names are non-empty strings
Visualization
Tap to expand
Understanding the Visualization
1
Load DNA Sequence
Read the genetic code string from the database
2
Apply Pattern Matchers
Use SQL functions to check for start/stop codons and motifs
3
Generate Boolean Flags
Convert pattern matches into 1/0 flags for each pattern type
4
Return Sorted Results
Order by sample_id and return comprehensive pattern analysis
Key Takeaway
๐ฏ Key Insight: SQL's built-in pattern matching functions (LIKE, REGEXP, LEFT, RIGHT) enable efficient detection of multiple biological patterns in a single database query, making this approach both elegant and performant for bioinformatics applications.
๐ก
Explanation
AI Ready
๐ก Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code