DNA Pattern Recognition - Problem

๐Ÿงฌ Dive into the fascinating world of bioinformatics! As a computational biologist, you're analyzing DNA sequences to identify important genetic patterns that could reveal insights about different species.

You have a database table Samples containing DNA sequences from various species. Your task is to scan each sequence and detect four crucial biological patterns:

  • Start Codon: Sequences beginning with ATG (signals protein synthesis start)
  • Stop Codons: Sequences ending with TAA, TAG, or TGA (signals protein synthesis end)
  • ATAT Motif: Sequences containing the repeating pattern ATAT
  • Triple G Pattern: Sequences with at least 3 consecutive G's (like GGG or GGGG)

For each DNA sample, return boolean flags indicating which patterns are present, ordered by sample_id.

Input: Table with sample_id, dna_sequence (A,T,G,C characters), and species
Output: All columns plus four pattern detection flags: has_start, has_stop, has_atat, has_ggg

Input & Output

example_1.py โ€” Basic Pattern Detection
$ Input: Samples: [(1, 'ATGCTAGCTAGCTAA', 'Human'), (2, 'GGGTCAATCATC', 'Human')]
โ€บ Output: [(1, 'ATGCTAGCTAGCTAA', 'Human', 1, 1, 0, 0), (2, 'GGGTCAATCATC', 'Human', 0, 0, 0, 1)]
๐Ÿ’ก Note: Sample 1 has start codon ATG and stop codon TAA. Sample 2 has triple G pattern GGG but no other patterns.
example_2.py โ€” ATAT Motif Detection
$ Input: Samples: [(3, 'ATATATCGTAGCTA', 'Human'), (6, 'ATATCGCGCTAG', 'Zebrafish')]
โ€บ Output: [(3, 'ATATATCGTAGCTA', 'Human', 0, 0, 1, 0), (6, 'ATATCGCGCTAG', 'Zebrafish', 0, 1, 1, 0)]
๐Ÿ’ก Note: Both sequences contain ATAT motif. Sample 6 also ends with TAG stop codon.
example_3.py โ€” Complex Pattern Combination
$ Input: Samples: [(4, 'ATGGGGTCATCATAA', 'Mouse')]
โ€บ Output: [(4, 'ATGGGGTCATCATAA', 'Mouse', 1, 1, 0, 1)]
๐Ÿ’ก Note: This sequence demonstrates multiple patterns: starts with ATG, contains GGGG (4 consecutive Gs), and ends with TAA.

Constraints

  • 1 โ‰ค number of samples โ‰ค 103
  • 1 โ‰ค length of dna_sequence โ‰ค 104
  • DNA sequences contain only characters A, T, G, C
  • sample_id values are unique integers
  • species names are non-empty strings

Visualization

Tap to expand
๐Ÿงฌ DNA SEQUENCE ANALYSISATGGGGTCATCATAASample ID: 4 | Species: MouseSTART CODONATGโœ“ Found1STOP CODONTAAโœ“ Found1ATAT MOTIFATATโœ— Not Found0TRIPLE GGGG+โœ“ Found1RESULT: (4, 'ATGGGGTCATCATAA', 'Mouse', 1, 1, 0, 1)All patterns detected in single SQL query
Understanding the Visualization
1
Load DNA Sequence
Read the genetic code string from the database
2
Apply Pattern Matchers
Use SQL functions to check for start/stop codons and motifs
3
Generate Boolean Flags
Convert pattern matches into 1/0 flags for each pattern type
4
Return Sorted Results
Order by sample_id and return comprehensive pattern analysis
Key Takeaway
๐ŸŽฏ Key Insight: SQL's built-in pattern matching functions (LIKE, REGEXP, LEFT, RIGHT) enable efficient detection of multiple biological patterns in a single database query, making this approach both elegant and performant for bioinformatics applications.
Asked in
Google 35 Amazon 28 Microsoft 22 Meta 18
32.0K Views
Medium Frequency
~15 min Avg. Time
1.5K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen