Biopython - Introduction



Biopython is the largest and most popular bioinformatics package for Python. It contains a number of different sub-modules for common bioinformatics tasks. It is developed by Chapman and Chang, mainly written in Python. It also contains C code to optimize the complex computation part of the software. It runs on Windows, Linux, Mac OS X, etc.

Basically, Biopython is a collection of python modules that provide functions to deal with DNA, RNA & protein sequence operations such as reverse complementing of a DNA string, finding motifs in protein sequences, etc. It provides lot of parsers to read all major genetic databases like GenBank, SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular bioinformatics software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It has sibling projects like BioPerl, BioJava and BioRuby.

Features

Biopython is portable, clear and has easy to learn syntax. Some of the salient features are listed below −

  • Interpreted, interactive and object oriented.

  • Supports FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats.

  • Option to deal with sequence formats.

  • Tools to manage protein structures.

  • BioSQL − Standard set of SQL tables for storing sequences plus features and annotations.

  • Access to online services and database, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite).

  • Access to local services, including Blast, Clustalw, EMBOSS.

Goals

The goal of Biopython is to provide simple, standard and extensive access to bioinformatics through python language. The specific goals of the Biopython are listed below −

  • Providing standardized access to bioinformatics resources.

  • High-quality, reusable modules and scripts.

  • Fast array manipulation that can be used in Cluster code, PDB, NaiveBayes and Markov Model.

  • Genomic data analysis.

Advantages

Biopython requires very less code and comes up with the following advantages −

  • Provides microarray data type used in clustering.

  • Reads and writes Tree-View type files.

  • Supports structure data used for PDB parsing, representation and analysis.

  • Supports journal data used in Medline applications.

  • Supports BioSQL database, which is widely used standard database amongst all bioinformatics projects.

  • Supports parser development by providing modules to parse a bioinformatics file into a format specific record object or a generic class of sequence plus features.

  • Clear documentation based on cookbook-style.

Sample Case Study

Let us check some of the use cases (population genetics, RNA structure, etc.,) and try to understand how Biopython plays an important role in this field −

Population Genetics

Population genetics is the study of genetic variation within a population, and involves the examination and modeling of changes in the frequencies of genes and alleles in populations over space and time.

Biopython provides Bio.PopGen module for population genetics. This module contains all the necessary functions to gather information about classic population genetics.

RNA Structure

Three major biological macromolecules that are essential for our life are DNA, RNA and Protein. Proteins are the workhorses of the cell and play an important role as enzymes. DNA (deoxyribonucleic acid) is considered as the “blueprint” of the cell. It carries all the genetic information required for the cell to grow, take in nutrients, and propagate. RNA (Ribonucleic acid) acts as “DNA photocopy” in the cell.

Biopython provides Bio.Sequence objects that represents nucleotides, building blocks of DNA and RNA.

Advertisements