Biopython - Sequence


Advertisements


A sequence is series of letters used to represent an organism’s protein, DNA or RNA. It is represented by Seq class. Seq class is defined in Bio.Seq module.

Let’s create a simple sequence in Biopython as shown below −

>>> from Bio.Seq import Seq 
>>> seq = Seq("AGCT") 
>>> seq 
Seq('AGCT') 
>>> print(seq) 
AGCT

Here, we have created a simple protein sequence AGCT and each letter represents Alanine, Glycine, Cysteine and Threonine.

Each Seq object has two important attributes −

  • data − the actual sequence string (AGCT)

  • alphabet − used to represent the type of sequence. e.g. DNA sequence, RNA sequence, etc. By default, it does not represent any sequence and is generic in nature.

Alphabet Module

Seq objects contain Alphabet attribute to specify sequence type, letters and possible operations. It is defined in Bio.Alphabet module. Alphabet can be defined as below −

>>> from Bio.Seq import Seq 
>>> myseq = Seq("AGCT") 
>>> myseq 
Seq('AGCT') 
>>> myseq.alphabet 
Alphabet()

Alphabet module provides below classes to represent different types of sequences. Alphabet - base class for all types of alphabets.

SingleLetterAlphabet - Generic alphabet with letters of size one. It derives from Alphabet and all other alphabets type derives from it.

>>> from Bio.Seq import Seq 
>>> from Bio.Alphabet import single_letter_alphabet 
>>> test_seq = Seq('AGTACACTGGT', single_letter_alphabet) 
>>> test_seq 
Seq('AGTACACTGGT', SingleLetterAlphabet())

ProteinAlphabet − Generic single letter protein alphabet.

>>> from Bio.Seq import Seq 
>>> from Bio.Alphabet import generic_protein 
>>> test_seq = Seq('AGTACACTGGT', generic_protein) 
>>> test_seq 
Seq('AGTACACTGGT', ProteinAlphabet())

NucleotideAlphabet − Generic single letter nucleotide alphabet.

>>> from Bio.Seq import Seq 
>>> from Bio.Alphabet import generic_nucleotide 
>>> test_seq = Seq('AGTACACTGGT', generic_nucleotide) >>> test_seq 
Seq('AGTACACTGGT', NucleotideAlphabet())

DNAAlphabet − Generic single letter DNA alphabet.

>>> from Bio.Seq import Seq 
>>> from Bio.Alphabet import generic_dna 
>>> test_seq = Seq('AGTACACTGGT', generic_dna) 
>>> test_seq 
Seq('AGTACACTGGT', DNAAlphabet())

RNAAlphabet − Generic single letter RNA alphabet.

>>> from Bio.Seq import Seq 
>>> from Bio.Alphabet import generic_rna 
>>> test_seq = Seq('AGTACACTGGT', generic_rna) 
>>> test_seq 
Seq('AGTACACTGGT', RNAAlphabet())

Biopython module, Bio.Alphabet.IUPAC provides basic sequence types as defined by IUPAC community. It contains the following classes −

  • IUPACProtein (protein) − IUPAC protein alphabet of 20 standard amino acids.

  • ExtendedIUPACProtein (extended_protein) − Extended uppercase IUPAC protein single letter alphabet including X.

  • IUPACAmbiguousDNA (ambiguous_dna) − Uppercase IUPAC ambiguous DNA.

  • IUPACUnambiguousDNA (unambiguous_dna) − Uppercase IUPAC unambiguous DNA (GATC).

  • ExtendedIUPACDNA (extended_dna) − Extended IUPAC DNA alphabet.

  • IUPACAmbiguousRNA (ambiguous_rna) − Uppercase IUPAC ambiguous RNA.

  • IUPACUnambiguousRNA (unambiguous_rna) − Uppercase IUPAC unambiguous RNA (GAUC).

Consider a simple example for IUPACProtein class as shown below −

>>> from Bio.Alphabet import IUPAC 
>>> protein_seq = Seq("AGCT", IUPAC.protein) 
>>> protein_seq 
Seq('AGCT', IUPACProtein()) 
>>> protein_seq.alphabet

Also, Biopython exposes all the bioinformatics related configuration data through Bio.Data module. For example, IUPACData.protein_letters has the possible letters of IUPACProtein alphabet.

>>> from Bio.Data import IUPACData 
>>> IUPACData.protein_letters 
'ACDEFGHIKLMNPQRSTVWY'

Basic Operations

This section briefly explains about all the basic operations available in the Seq class. Sequences are similar to python strings. We can perform python string operations like slicing, counting, concatenation, find, split and strip in sequences.

Use the below codes to get various outputs.

To get the first value in sequence.

>>> seq_string = Seq("AGCTAGCT") 
>>> seq_string[0] 
'A'

To print the first two values.

>>> seq_string[0:2] 
Seq('AG')

To print all the values.

>>> seq_string[ : ] 
Seq('AGCTAGCT')

To perform length and count operations.

>>> len(seq_string) 
8 
>>> seq_string.count('A') 
2

To add two sequences.

>>> from Bio.Alphabet import generic_dna, generic_protein 
>>> seq1 = Seq("AGCT", generic_dna) 
>>> seq2 = Seq("TCGA", generic_dna)
>>> seq1+seq2 
Seq('AGCTTCGA', DNAAlphabet())

Here, the above two sequence objects, seq1, seq2 are generic DNA sequences and so you can add them and produce new sequence. You can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence as specified below −

>>> dna_seq = Seq('AGTACACTGGT', generic_dna) 
>>> protein_seq = Seq('AGUACACUGGU', generic_protein) 
>>> dna_seq + protein_seq 
..... 
..... 
TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet() 
>>>

To add two or more sequences, first store it in a python list, then retrieve it using ‘for loop’ and finally add it together as shown below −

>>> from Bio.Alphabet import generic_dna 
>>> list = [Seq("AGCT",generic_dna),Seq("TCGA",generic_dna),Seq("AAA",generic_dna)] 
>>> for s in list: 
... print(s) 
... 
AGCT 
TCGA 
AAA 
>>> final_seq = Seq(" ",generic_dna) 
>>> for s in list: 
... final_seq = final_seq + s 
... 
>>> final_seq 
Seq('AGCTTCGAAAA', DNAAlphabet())

In the below section, various codes are given to get outputs based on the requirement.

To change the case of sequence.

>>> from Bio.Alphabet import generic_rna 
>>> rna = Seq("agct", generic_rna) 
>>> rna.upper() 
Seq('AGCT', RNAAlphabet())

To check python membership and identity operator.

>>> rna = Seq("agct", generic_rna) 
>>> 'a' in rna 
True 
>>> 'A' in rna 
False 
>>> rna1 = Seq("AGCT", generic_dna) 
>>> rna is rna1 
False

To find single letter or sequence of letter inside the given sequence.

>>> protein_seq = Seq('AGUACACUGGU', generic_protein) 
>>> protein_seq.find('G') 
1 
>>> protein_seq.find('GG') 
8

To perform splitting operation.

>>> protein_seq = Seq('AGUACACUGGU', generic_protein) 
>>> protein_seq.split('A') 
[Seq('', ProteinAlphabet()), Seq('GU', ProteinAlphabet()), 
   Seq('C', ProteinAlphabet()), Seq('CUGGU', ProteinAlphabet())]

To perform strip operations in the sequence.

>>> strip_seq = Seq(" AGCT ") 
>>> strip_seq 
Seq(' AGCT ') 
>>> strip_seq.strip() 
Seq('AGCT')


Advertisements