Biopython - Sequence Alignments



Sequence alignment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them.

Identifying the similar region enables us to infer a lot of information like what traits are conserved between species, how close different species genetically are, how species evolve, etc. Biopython provides extensive support for sequence alignment.

Let us learn some of the important features provided by Biopython in this chapter −

Parsing Sequence Alignment

Biopython provides a module, Bio.AlignIO to read and write sequence alignments. In bioinformatics, there are lot of formats available to specify the sequence alignment data similar to earlier learned sequence data. Bio.AlignIO provides API similar to Bio.SeqIO except that the Bio.SeqIO works on the sequence data and Bio.AlignIO works on the sequence alignment data.

Before starting to learn, let us download a sample sequence alignment file from the Internet.

To download the sample file, follow the below steps −

Step 1 − Open your favorite browser and go to http://pfam.xfam.org/family/browse website. It will show all the Pfam families in alphabetical order.

Step 2 − Choose any one family having less number of seed value. It contains minimal data and enables us to work easily with the alignment. Here, we have selected/clicked PF18225 and it opens go to http://pfam.xfam.org/family/PF18225 and shows complete details about it, including sequence alignments.

Step 3 − Go to alignment section and download the sequence alignment file in Stockholm format (PF18225.alignment.seed).

Let us try to read the downloaded sequence alignment file using Bio.AlignIO as below −

Import Bio.AlignIO module

>>> from Bio import AlignIO

Read alignment using read method. read method is used to read single alignment data available in the given file. If the given file contain many alignment, we can use parse method. parse method returns iterable alignment object similar to parse method in Bio.SeqIO module.

>>> alignment = AlignIO.read(open("PF18225.alignment.seed"), "stockholm")

Print the alignment object.

>>> print(alignment)
Alignment with 5 rows and 65 columns
AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126
AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125
MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123
ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_MICTH/57-121
TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121
>>>

We can also check the sequences (SeqRecord) available in the alignment as well as below −

>>> for align in alignment: print(align.seq) 
... 
AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIMVLAPRLTAKHPYDKVQDRNRK
AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVADLMRKLDLDRPFKKLERKNRT
MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVATVANQLRGRKRRAFARHREGP
ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMAPMLIALNYRNRESHAQVDKKP
TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMAPLFKVLSFRNREDQGLVNNKP
>>>

Pairwise Sequence Alignment

Pairwise sequence alignment compares only two sequences at a time and provides best possible sequence alignments. Pairwise is easy to understand and exceptional to infer from the resulting sequence alignment.

Biopython provides a special module, Bio.pairwise2 to identify the alignment sequence using pairwise method. Biopython applies the best algorithm to find the alignment sequence and it is par with other software.

Let us write an example to find the sequence alignment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence alignment and how to program it using Biopython.

Step 1

Import the module pairwise2 with the command given below −

>>> from Bio import pairwise2

Step 2

Create two sequences, seq1 and seq2 −

>>> from Bio.Seq import Seq 
>>> seq1 = Seq("ACCGGT") 
>>> seq2 = Seq("ACGT")

Step 3

Call method pairwise2.align.globalxx along with seq1 and seq2 to find the alignments using the below line of code −

>>> alignments = pairwise2.align.globalxx(seq1, seq2)

Here, globalxx method performs the actual work and finds all the best possible alignments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find alignments in different scenarios.

<sequence alignment type>XY

Here, the sequence alignment type refers to the alignment type which may be global or local. global type is finding sequence alignment by taking entire sequence into consideration. local type is finding sequence alignment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences.

  • X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm).

  • Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties)

So, localds is also a valid method, which finds the sequence alignment using local alignment technique, user provided dictionary for matches and user provided gap penalty for both sequences.

Step 4

Loop over the iterable alignments object and get each individual alignment object and print it.

>>> for alignment in alignments: print(alignment) 
... 
Alignment(seqA='ACCGGT', seqB='A-C-GT', score=4.0, start=0, end=6)
Alignment(seqA='ACCGGT', seqB='AC--GT', score=4.0, start=0, end=6)
Alignment(seqA='ACCGGT', seqB='A-CG-T', score=4.0, start=0, end=6)
Alignment(seqA='ACCGGT', seqB='AC-G-T', score=4.0, start=0, end=6)

Step 5

Bio.pairwise2 module provides a formatting method, format_alignment to better visualize the result −

>>> from Bio.pairwise2 import format_alignment 
>>> alignments = pairwise2.align.globalxx(seq1, seq2) 
>>> for alignment in alignments:print(format_alignment(*alignment)) 
...

ACCGGT 
| | || 
A-C-GT 
   Score=4 
   
ACCGGT 
|| || 
AC--GT 
   Score=4 

ACCGGT 
| || | 
A-CG-T 
   Score=4 

ACCGGT 
|| | | 
AC-G-T 
   Score=4

>>>

Biopython also provides another module to do sequence alignment, Align. This module provides a different set of API to simply the setting of parameter like algorithm, mode, match score, gap penalties, etc., A simple look into the Align object is as follows −

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> print(aligner)
Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  open_internal_insertion_score: -1.000000
  extend_internal_insertion_score: -1.000000
  open_left_insertion_score: -1.000000
  extend_left_insertion_score: -1.000000
  open_right_insertion_score: -1.000000
  extend_right_insertion_score: -1.000000
  open_internal_deletion_score: -1.000000
  extend_internal_deletion_score: -1.000000
  open_left_deletion_score: -1.000000
  extend_left_deletion_score: -1.000000
  open_right_deletion_score: -1.000000
  extend_right_deletion_score: -1.000000
  mode: global
>>>

Support for Sequence Alignment Tools

Biopython provides interface to a lot of sequence alignment tools through Bio.Align.Applications module. Some of the tools are listed below −

  • ClustalW
  • MUSCLE
  • EMBOSS needle and water

Let us write a simple example in Biopython to create sequence alignment through the most popular alignment tool, ClustalW.

Step 1 − Download the Clustalw program from http://www.clustal.org/download/current/ and install it. Also, update the system PATH with the clustal installation path.

Step 2 − import ClustalwCommanLine from module Bio.Align.Applications.

>>> from Bio.Align.Applications import ClustalwCommandline

Step 3 − Set cmd by calling ClustalwCommanLine with input file, opuntia.fasta available in Biopython package. https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/opuntia.fasta

>>> cmd = ClustalwCommandline("clustalw2",
infile="opuntia.fasta")
>>> print(cmd)
clustalw2 -infile=fasta/opuntia.fasta

Step 4 − Calling cmd() will run the clustalw command and give an output of the resultant alignment file, opuntia.aln.

>>> stdout, stderr = cmd()

Step 5 − Read and print the alignment file as below −

>>> from Bio import AlignIO
>>> align = AlignIO.read("opuntia.aln", "clustal")
>>> print(align)
alignment with 7 rows and 906 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273291|gb|AF191665.1|AF191
>>>
Advertisements