Repeated DNA Sequences in C++


Suppose we have a DNA sequence. As we know, all DNA is composed of a series of nucleotides abbreviated such as A, C, G, and T, for example: "ACGAATTCCG". When we are studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

We have to write one method to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

So if the input is like “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”, then the output will be ["AAAAACCCCC", "CCCCCAAAAA"].

To solve this, we will follow these steps −

  • Define an array ret, n := size of s, create two sets called visited and visited2

  • define a map called bitVal.

  • Store corresponding values for ACGT like 0123 into butVal.

  • mask := 0

  • for i in range 0 to n – 1

    • mask := mask * 4

    • mask := mast OR bitVal[s[i]]

    • mask := mask AND FFFFF

    • if i < 9, then just continue to the next iteration

      • insert substring form index i – 9 to 9, into ret

      • insert mark into visited2.

    • insert mask into visited

  • return ret

Example(C++)

Let us see the following implementation to get a better understanding −

 Live Demo

#include <bits/stdc++.h>
using namespace std;
void print_vector(vector<auto> v){
   cout << "[";
   for(int i = 0; i<v.size(); i++){
      cout << v[i] << ", ";
   }
   cout << "]"<<endl;
}
typedef long long int lli;
class Solution {
public:
   vector<string>findRepeatedDnaSequences(string s) {
      vector <string> ret;
      int n = s.size();
      set <int> visited;
      set <int> visited2;
      map <char, int> bitVal;
      bitVal['A'] = 0;
      bitVal['C'] = 1;
      bitVal['G'] = 2;
      bitVal['T'] = 3;
      lli mask = 0;
      for(int i = 0; i < n; i++){
         mask <<= 2;
         mask |= bitVal[s[i]];
         mask &= 0xfffff;
         if(i < 9) continue;
         if(visited.count(mask) && !visited2.count(mask)){
            ret.push_back(s.substr(i - 9, 10));
            visited2.insert(mask);
         }
         visited.insert(mask);
      }
      return ret;
   }
};
main(){
   Solution ob;
   print_vector(ob.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"));
}

Input

"AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output

[AAAAACCCCC, CCCCCAAAAA, ]

Updated on: 02-May-2020

943 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements