Biopython - Sequence Alignments
Sequence alignment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them.
Identifying the similar region enables us to infer a lot of information like what traits are conserved between species, how close different species genetically are, how species evolve, etc. Biopython provides extensive support for sequence alignment.
Let us learn some of the important features provided by Biopython in this chapter −
Parsing Sequence Alignment
Biopython provides a module, Bio.AlignIO to read and write sequence alignments. In bioinformatics, there are lot of formats available to specify the sequence alignment data similar to earlier learned sequence data. Bio.AlignIO provides API similar to Bio.SeqIO except that the Bio.SeqIO works on the sequence data and Bio.AlignIO works on the sequence alignment data.
Before starting to learn, let us download a sample sequence alignment file from the Internet.
To download the sample file, follow the below steps −
Step 1 − Open your favorite browser and go to http://pfam.xfam.org/family/browse website. It will show all the Pfam families in alphabetical order.
Step 2 − Choose any one family having less number of seed value. It contains minimal data and enables us to work easily with the alignment. Here, we have selected/clicked PF18225 and it opens go to http://pfam.xfam.org/family/PF18225 and shows complete details about it, including sequence alignments.
Step 3 − Go to alignment section and download the sequence alignment file in Stockholm format (PF18225_seed.txt).
Let us try to read the downloaded sequence alignment file using Bio.AlignIO as below −
Import Bio.AlignIO module
>>> from Bio import AlignIO
Read alignment using read method. read method is used to read single alignment data available in the given file. If the given file contain many alignment, we can use parse method. parse method returns iterable alignment object similar to parse method in Bio.SeqIO module.
>>> alignment = AlignIO.read(open("PF18225_seed.txt"), "stockholm")
Print the alignment object.
>>> print(alignment) SingleLetterAlphabet() alignment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125 >>>
We can also check the sequences (SeqRecord) available in the alignment as well as below −
>>> for align in alignment: ... print(align.seq) ... MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVATVANQLRGRKRRAFARHREGP AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADITA---RLDRRREHGEHGVRKKP ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMAPMLIALNYRNRESHAQVDKKP TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMAPLFKVLSFRNREDQGLVNNKP AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIMVLAPRLTAKHPYDKVQDRNRK AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVADLMRKLDLDRPFKKLERKNRT >>>
In general, most of the sequence alignment files contain single alignment data and it is enough to use read method to parse it. In multiple sequence alignment concept, two or more sequences are compared for best subsequence matches between them and results in multiple sequence alignment in a single file.
If the input sequence alignment format contains more than one sequence alignment, then we need to use parse method instead of read method as specified below −
>>> from Bio import AlignIO >>> alignments = AlignIO.parse(open("PF18225_seed.txt"), "stockholm") >>> print(alignments) <generator object parse at 0x000001CD1C7E0360> >>> for alignment in alignments: ... print(alignment) ... SingleLetterAlphabet() alignment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125 >>>
Here, parse method returns iterable alignment object and it can be iterated to get actual alignments.
Pairwise Sequence Alignment
Pairwise sequence alignment compares only two sequences at a time and provides best possible sequence alignments. Pairwise is easy to understand and exceptional to infer from the resulting sequence alignment.
Biopython provides a special module, Bio.pairwise2 to identify the alignment sequence using pairwise method. Biopython applies the best algorithm to find the alignment sequence and it is par with other software.
Let us write an example to find the sequence alignment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence alignment and how to program it using Biopython.
Import the module pairwise2 with the command given below −
>>> from Bio import pairwise2
Create two sequences, seq1 and seq2 −
>>> from Bio.Seq import Seq >>> seq1 = Seq("ACCGGT") >>> seq2 = Seq("ACGT")
Call method pairwise2.align.globalxx along with seq1 and seq2 to find the alignments using the below line of code −
>>> alignments = pairwise2.align.globalxx(seq1, seq2)
Here, globalxx method performs the actual work and finds all the best possible alignments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find alignments in different scenarios.
<sequence alignment type>XY
Here, the sequence alignment type refers to the alignment type which may be global or local. global type is finding sequence alignment by taking entire sequence into consideration. local type is finding sequence alignment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences.
X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm).
Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties)
So, localds is also a valid method, which finds the sequence alignment using local alignment technique, user provided dictionary for matches and user provided gap penalty for both sequences.
>>> test_alignments = pairwise2.align.localds(seq1, seq2, blosum62, -10, -1)
Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score. -10 refers to gap open penalty and -1 refers to gap extension penalty.
Loop over the iterable alignments object and get each individual alignment object and print it.
>>> for alignment in alignments: ... print(alignment) ... ('ACCGGT', 'A-C-GT', 4.0, 0, 6) ('ACCGGT', 'AC--GT', 4.0, 0, 6) ('ACCGGT', 'A-CG-T', 4.0, 0, 6) ('ACCGGT', 'AC-G-T', 4.0, 0, 6)
Bio.pairwise2 module provides a formatting method, format_alignment to better visualize the result −
>>> from Bio.pairwise2 import format_alignment >>> alignments = pairwise2.align.globalxx(seq1, seq2) >>> for alignment in alignments: ... print(format_alignment(*alignment)) ... ACCGGT | | || A-C-GT Score=4 ACCGGT || || AC--GT Score=4 ACCGGT | || | A-CG-T Score=4 ACCGGT || | | AC-G-T Score=4 >>>
Biopython also provides another module to do sequence alignment, Align. This module provides a different set of API to simply the setting of parameter like algorithm, mode, match score, gap penalties, etc., A simple look into the Align object is as follows −
>>> from Bio import Align >>> aligner = Align.PairwiseAligner() >>> print(aligner) Pairwise sequence aligner with parameters match score: 1.000000 mismatch score: 0.000000 target open gap score: 0.000000 target extend gap score: 0.000000 target left open gap score: 0.000000 target left extend gap score: 0.000000 target right open gap score: 0.000000 target right extend gap score: 0.000000 query open gap score: 0.000000 query extend gap score: 0.000000 query left open gap score: 0.000000 query left extend gap score: 0.000000 query right open gap score: 0.000000 query right extend gap score: 0.000000 mode: global >>>
Support for Sequence Alignment Tools
Biopython provides interface to a lot of sequence alignment tools through Bio.Align.Applications module. Some of the tools are listed below −
- EMBOSS needle and water
Let us write a simple example in Biopython to create sequence alignment through the most popular alignment tool, ClustalW.
Step 1 − Download the Clustalw program from http://www.clustal.org/download/current/ and install it. Also, update the system PATH with the “clustal” installation path.
Step 2 − import ClustalwCommanLine from module Bio.Align.Applications.
>>> from Bio.Align.Applications import ClustalwCommandline
Step 3 − Set cmd by calling ClustalwCommanLine with input file, opuntia.fasta available in Biopython package. https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/opuntia.fasta
>>> cmd = ClustalwCommandline("clustalw2", infile="/path/to/biopython/sample/opuntia.fasta") >>> print(cmd) clustalw2 -infile=fasta/opuntia.fasta
Step 4 − Calling cmd() will run the clustalw command and give an output of the resultant alignment file, opuntia.aln.
>>> stdout, stderr = cmd()
Step 5 − Read and print the alignment file as below −
>>> from Bio import AlignIO >>> align = AlignIO.read("/path/to/biopython/sample/opuntia.aln", "clustal") >>> print(align) SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191 >>>