Biopython - Sequence I/O Operations



Biopython provides a module, Bio.SeqIO to read and write sequences from and to a file (any stream) respectively. It supports nearly all file formats available in bioinformatics. Most of the software provides different approach for different file formats. But, Biopython consciously follows a single approach to present the parsed sequence data to the user through its SeqRecord object.

Let us learn more about SeqRecord in the following section.

SeqRecord

Bio.SeqRecord module provides SeqRecord to hold meta information of the sequence as well as the sequence data itself as given below −

  • seq − It is an actual sequence.

  • id − It is the primary identifier of the given sequence. The default type is string.

  • name − It is the Name of the sequence. The default type is string.

  • description − It displays human readable information about the sequence.

  • annotations − It is a dictionary of additional information about the sequence.

The SeqRecord can be imported as specified below

from Bio.SeqRecord import SeqRecord

Let us understand the nuances of parsing the sequence file using real sequence file in the coming sections.

Parsing Sequence File Formats

This section explains about how to parse two of the most popular sequence file formats, FASTA and GenBank.

FASTA

FASTA is the most basic file format for storing sequence data. Originally, FASTA is a software package for sequence alignment of DNA and protein developed during the early evolution of Bioinformatics and used mostly to search the sequence similarity.

Biopython provides an example FASTA file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta.

Download and save this file into your Biopython sample directory as ‘orchid.fasta’.

Bio.SeqIO module provides parse() method to process sequence files and can be imported as follows −

from Bio.SeqIO import parse

parse() method contains two arguments, first one is file handle and second is file format.

>>> file = open('path/to/biopython/sample/orchid.fasta') 
>>> for record in parse(file, "fasta"): 
...    print(record.id) 
... 
gi|2765658|emb|Z78533.1|CIZ78533 
gi|2765657|emb|Z78532.1|CCZ78532 
.......... 
.......... 
gi|2765565|emb|Z78440.1|PPZ78440 
gi|2765564|emb|Z78439.1|PBZ78439 
>>>

Here, the parse() method returns an iterable object which returns SeqRecord on every iteration. Being iterable, it provides lot of sophisticated and easy methods and let us see some of the features.

next()

next() method returns the next item available in the iterable object, which we can be used to get the first sequence as given below −

>>> first_seq_record = next(SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta')) 
>>> first_seq_record.id 'gi|2765658|emb|Z78533.1|CIZ78533' 
>>> first_seq_record.name 'gi|2765658|emb|Z78533.1|CIZ78533' 
>>> first_seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 
>>> first_seq_record.description 'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' 
>>> first_seq_record.annotations 
{} 
>>>

Here, seq_record.annotations is empty because the FASTA format does not support sequence annotations.

list comprehension

We can convert the iterable object into list using list comprehension as given below

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> all_seq = [seq_record for seq_record in seq_iter] >>> len(all_seq) 
94 
>>>

Here, we have used len method to get the total count. We can get sequence with maximum length as follows −

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> max_seq = max(len(seq_record.seq) for seq_record in seq_iter) 
>>> max_seq 
789 
>>>

We can filter the sequence as well using the below code −

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> seq_under_600 = [seq_record for seq_record in seq_iter if len(seq_record.seq) < 600] 
>>> for seq in seq_under_600: 
...    print(seq.id) 
... 
gi|2765606|emb|Z78481.1|PIZ78481 
gi|2765605|emb|Z78480.1|PGZ78480 
gi|2765601|emb|Z78476.1|PGZ78476 
gi|2765595|emb|Z78470.1|PPZ78470 
gi|2765594|emb|Z78469.1|PHZ78469 
gi|2765564|emb|Z78439.1|PBZ78439 
>>>

Writing a collection of SqlRecord objects (parsed data) into file is as simple as calling the SeqIO.write method as below −

file = open("converted.fasta", "w) 
SeqIO.write(seq_record, file, "fasta")

This method can be effectively used to convert the format as specified below −

file = open("converted.gbk", "w) 
SeqIO.write(seq_record, file, "genbank")

GenBank

It is a richer sequence format for genes and includes fields for various kinds of annotations. Biopython provides an example GenBank file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta.

Download and save file into your Biopython sample directory as ‘orchid.gbk’

Since, Biopython provides a single function, parse to parse all bioinformatics format. Parsing GenBank format is as simple as changing the format option in the parse method.

The code for the same has been given below −

>>> from Bio import SeqIO 
>>> from Bio.SeqIO import parse 
>>> seq_record = next(parse(open('path/to/biopython/sample/orchid.gbk'),'genbank')) 
>>> seq_record.id 
'Z78533.1' 
>>> seq_record.name 
'Z78533' 
>>> seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) 
>>> seq_record.description 
'C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' 
>>> seq_record.annotations {
   'molecule_type': 'DNA', 
   'topology': 'linear', 
   'data_file_division': 'PLN', 
   'date': '30-NOV-2006', 
   'accessions': ['Z78533'], 
   'sequence_version': 1, 
   'gi': '2765658', 
   'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 
   'source': 'Cypripedium irapeanum', 
   'organism': 'Cypripedium irapeanum', 
   'taxonomy': [
      'Eukaryota', 
      'Viridiplantae', 
      'Streptophyta', 
      'Embryophyta', 
      'Tracheophyta', 
      'Spermatophyta', 
      'Magnoliophyta', 
      'Liliopsida', 
      'Asparagales', 
      'Orchidaceae', 
      'Cypripedioideae', 
      'Cypripedium'], 
   'references': [
      Reference(title = 'Phylogenetics of the slipper orchids (Cypripedioideae:
      Orchidaceae): nuclear rDNA ITS sequences', ...), 
      Reference(title = 'Direct Submission', ...)
   ]
}
Advertisements