Biopython - Advanced Sequence Operations


In this chapter, we shall discuss some of the advanced sequence features provided by Biopython.

Complement and Reverse Complement

Nucleotide sequence can be reverse complemented to get new sequence. Also, the complemented sequence can be reverse complemented to get the original sequence. Biopython provides two methods to do this functionality − complement and reverse_complement. The code for this is given below −

>>> from Bio.Alphabet import IUPAC 
>>> nucleotide = Seq('TCGAAGTCAGTC', IUPAC.ambiguous_dna) 
>>> nucleotide.complement() 

Here, the complement() method allows to complement a DNA or RNA sequence. The reverse_complement() method complements and reverses the resultant sequence from left to right. It is shown below −

>>> nucleotide.reverse_complement() 

Biopython uses the ambiguous_dna_complement variable provided by Bio.Data.IUPACData to do the complement operation.

>>> from Bio.Data import IUPACData 
>>> import pprint 
>>> pprint.pprint(IUPACData.ambiguous_dna_complement) {
   'A': 'T',
   'B': 'V',
   'C': 'G',
   'D': 'H',
   'G': 'C',
   'H': 'D',
   'K': 'M',
   'M': 'K',
   'N': 'N',
   'R': 'Y',
   'S': 'S',
   'T': 'A',
   'V': 'B',
   'W': 'W',
   'X': 'X',
   'Y': 'R'} 

GC Content

Genomic DNA base composition (GC content) is predicted to significantly affect genome functioning and species ecology. The GC content is the number of GC nucleotides divided by the total nucleotides.

To get the GC nucleotide content, import the following module and perform the following steps −

>>> from Bio.SeqUtils import GC 
>>> nucleotide = Seq("GACTGACTTCGA",IUPAC.unambiguous_dna) 
>>> GC(nucleotide) 


Transcription is the process of changing DNA sequence into RNA sequence. The actual biological transcription process is performing a reverse complement (TCAG → CUGA) to get the mRNA considering the DNA as template strand. However, in bioinformatics and so in Biopython, we typically work directly with the coding strand and we can get the mRNA sequence by changing the letter T to U.

Simple example for the above is as follows −

>>> from Bio.Seq import Seq 
>>> from Bio.Seq import transcribe 
>>> from Bio.Alphabet import IUPAC 
>>> dna_seq = Seq("ATGCCGATCGTAT",IUPAC.unambiguous_dna) >>> transcribe(dna_seq) 

To reverse the transcription, T is changed to U as shown in the code below −

>>> rna_seq = transcribe(dna_seq) 
>>> rna_seq.back_transcribe() 

To get the DNA template strand, reverse_complement the back transcribed RNA as given below −

>>> rna_seq.back_transcribe().reverse_complement() 


Translation is a process of translating RNA sequence to protein sequence. Consider a RNA sequence as shown below −

>>> rna_seq = Seq("AUGGCCAUUGUAAU",IUPAC.unambiguous_rna) 
>>> rna_seq 

Now, apply translate() function to the code above −

>>> rna_seq.translate() 
Seq('MAIV', IUPACProtein())

The above RNA sequence is simple. Consider RNA sequence, AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGA and apply translate() −

>>> rna.translate() 
Seq('MAIVMGR*KGAR', HasStopCodon(IUPACProtein(), '*'))

Here, the stop codons are indicated with an asterisk ’*’.

It is possible in translate() method to stop at the first stop codon. To perform this, you can assign to_stop=True in translate() as follows −

>>> rna.translate(to_stop = True) 
Seq('MAIVMGR', IUPACProtein())

Here, the stop codon is not included in the resulting sequence because it does not contain one.

Translation Table

The Genetic Codes page of the NCBI provides full list of translation tables used by Biopython. Let us see an example for standard table to visualize the code −

>>> from Bio.Data import CodonTable 
>>> table = CodonTable.unambiguous_dna_by_name["Standard"] 
>>> print(table) 
Table 1 Standard, SGC0
   | T       | C       | A       | G       | 
 T | TTT F   | TCT S   | TAT Y   | TGT C   | T
 T | TTC F   | TCC S   | TAC Y   | TGC C   | C
 T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
 T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G 
 C | CTT L   | CCT P   | CAT H   | CGT R   | T
 C | CTC L   | CCC P   | CAC H   | CGC R   | C
 C | CTA L   | CCA P   | CAA Q   | CGA R   | A
 C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G 
 A | ATT I   | ACT T   | AAT N   | AGT S   | T
 A | ATC I   | ACC T   | AAC N   | AGC S   | C
 A | ATA I   | ACA T   | AAA K   | AGA R   | A
 A | ATG M(s)| ACG T   | AAG K   | AGG R   | G 
 G | GTT V   | GCT A   | GAT D   | GGT G   | T
 G | GTC V   | GCC A   | GAC D   | GGC G   | C
 G | GTA V   | GCA A   | GAA E   | GGA G   | A
 G | GTG V   | GCG A   | GAG E   | GGG G   | G 

Biopython uses this table to translate the DNA to protein as well as to find the Stop codon.