Biopython - PDB Module


Biopython provides Bio.PDB module to manipulate polypeptide structures. The PDB (Protein Data Bank) is the largest protein structure resource available online. It hosts a lot of distinct protein structures, including protein-protein, protein-DNA, protein-RNA complexes.

In order to load the PDB, type the below command −

from Bio.PDB import *

Protein Structure File Formats

The PDB distributes protein structures in three different formats −

  • The XML-based file format which is not supported by Biopython
  • The pdb file format, which is a specially formatted text file
  • PDBx/mmCIF files format

PDB files distributed by the Protein Data Bank may contain formatting errors that make them ambiguous or difficult to parse. The Bio.PDB module attempts to deal with these errors automatically.

The Bio.PDB module implements two different parsers, one is mmCIF format and second one is pdb format.

Let us learn how to parser each of the format in detail −

mmCIF Parser

Let us download an example database in mmCIF format from pdb server using the below command −

>>> pdbl = PDBList() 
>>> pdbl.retrieve_pdb_file('2FAT', pdir = '.', file_format = 'mmCif')

This will download the specified file (2fat.cif) from the server and store it in the current working directory.

Here, PDBList provides options to list and download files from online PDB FTP server. retrieve_pdb_file method needs the name of the file to be downloaded without extension. retrieve_pdb_file also have option to specify download directory, pdir and format of the file, file_format. The possible values of file format are as follows −

  • “mmCif” (default, PDBx/mmCif file)
  • “pdb” (format PDB)
  • “xml” (PMDML/XML format)
  • “mmtf” (highly compressed)
  • “bundle” (PDB formatted archive for large structure)

To load a cif file, use Bio.MMCIF.MMCIFParser as specified below −

>>> parser = MMCIFParser(QUIET = True) 
>>> data = parser.get_structure("2FAT", "2FAT.cif")

Here, QUIET suppresses the warning during parsing the file. get_structure will parse the file and return the structure with id as 2FAT (first argument).

After running the above command, it parses the file and prints possible warning, if available.

Now, check the structure using the below command −

>>> data 
<Structure id = 2FAT>

To get the type, use type method as specified below,

>>> print(type(data)) 
<class 'Bio.PDB.Structure.Structure'>

We have successfully parsed the file and got the structure of the protein. We will learn the details of the protein structure and how to get it in the later chapter.

PDB Parser

Let us download an example database in PDB format from pdb server using the below command −

>>> pdbl = PDBList() 
>>> pdbl.retrieve_pdb_file('2FAT', pdir = '.', file_format = 'pdb')

This will download the specified file (pdb2fat.ent) from the server and store it in the current working directory.

To load a pdb file, use Bio.PDB.PDBParser as specified below −

>>> parser = PDBParser(PERMISSIVE = True, QUIET = True) 
>>> data = parser.get_structure("2fat","pdb2fat.ent")

Here, get_structure is similar to MMCIFParser. PERMISSIVE option try to parse the protein data as flexible as possible.

Now, check the structure and its type with the code snippet given below −

>>> data 
<Structure id = 2fat> 
>>> print(type(data)) 
<class 'Bio.PDB.Structure.Structure'>

Well, the header structure stores the dictionary information. To perform this, type the below command −

>>> print(data.header.keys()) dict_keys([
   'name', 'head', 'deposition_date', 'release_date', 'structure_method', 'resolution', 
   'structure_reference', 'journal_reference', 'author', 'compound', 'source', 
   'keywords', 'journal']) 

To get the name, use the following code −

>>> print(data.header["name"]) 
an anti-urokinase plasminogen activator receptor (upar) antibody: crystal 
structure and binding epitope

You can also check the date and resolution with the below code −

>>> print(data.header["release_date"]) 2006-11-14 
>>> print(data.header["resolution"]) 1.77

PDB Structure

PDB structure is composed of a single model, containing two chains.

  • chain L, containing number of residues
  • chain H, containing number of residues

Each residue is composed of multiple atoms, each having a 3D position represented by (x, y, z) coordinates.

Let us learn how to get the structure of the atom in detail in the below section −


The Structure.get_models() method returns an iterator over the models. It is defined below −

>>> model = data.get_models() 
>>> model 
<generator object get_models at 0x103fa1c80> 
>>> models = list(model) 
>>> models [<Model id = 0>] 
>>> type(models[0]) 
<class 'Bio.PDB.Model.Model'>

Here, a Model describes exactly one 3D conformation. It contains one or more chains.


The Model.get_chain() method returns an iterator over the chains. It is defined below −

>>> chains = list(models[0].get_chains()) 
>>> chains 
[<Chain id = L>, <Chain id = H>] 
>>> type(chains[0]) 
<class 'Bio.PDB.Chain.Chain'>

Here, Chain describes a proper polypeptide structure, i.e., a consecutive sequence of bound residues.


The Chain.get_residues() method returns an iterator over the residues. It is defined below −

>>> residue = list(chains[0].get_residues())
>>> len(residue) 
>>> residue1 = list(chains[1].get_residues()) 
>>> len(residue1) 

Well, Residue holds the atoms that belong to an amino acid.


The Residue.get_atom() returns an iterator over the atoms as defined below −

>>> atoms = list(residue[0].get_atoms()) 
>>> atoms 
[<Atom N>, <Atom CA>, <Atom C>, <Atom Ov, <Atom CB>, <Atom CG>, <Atom OD1>, <Atom OD2>]

An atom holds the 3D coordinate of an atom and it is called a Vector. It is defined below

>>> atoms[0].get_vector() 
<Vector 18.49, 73.26, 44.16>

It represents x, y and z co-ordinate values.