BLAST: Basic Local Alignment Search Tool


Keywords

BLAST, bioinformatics, heuristic algorithm, program, biological sequence, proteins, nucleotides, database sequences, maximal segment pair, alignments, DNA and RNA sequences.

Introduction

BLAST (basic local alignment search tool) in bioinformatics, is an algorithm and program for comparing primary biological sequence information, such as the amino acid sequences of proteins or the nucleotides of DNA and RNA sequences.

BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences and identify database sequences that resemble alphabet above a certain threshold. The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment BLAST is available on the web on the NCBI website. Different types of BLASTs are available according to the query sequences and the target databases.

Process

BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment.

These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix. The threshold score T determines whether a particular word will be included in the alignment. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. If the score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results.

Algorithm

The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs. An overview of the BLAST algorithm (a protein-to-protein search) is as follows.

  • Remove low-complexity region or sequence repeats in the query sequence.

  • Make a k-letter word list of the query sequence.

  • List the possible matching words.

  • Organize the remaining high-scoring words into an efficient search tree.

  • Repeat step 3 to 4 for each k-letter word in the query sequence.

  • Scan the database sequences for exact matches with the remaining high-scoring words.

  • Extend the exact matches to high-scoring segment pair (HSP).

  • List all the HSPs in the database whose score is high enough to be considered.

  • Evaluate the significance of the HSP score.

  • Make two or more HSP regions into a longer alignment.

  • Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.

  • Report every match whose expected score is lower than a threshold parameter E.

Program

BLAST is a family of programs that can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. There are now a handful of different BLAST programs available. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below. Of these programs, BLASTn and BLASTp are the most used.

Nucleotide-nucleotide BLAST (blastn)

Given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.

Protein-protein BLAST (blastp)

Given a protein query, returns the most similar protein sequences from the protein database that the user specifies.

Position-specific iterative BLAST (PSI-BLAST) (blastpgp)

This program is used to find distant relatives of a protein. PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST.

Nucleotide 6-frame translation-protein (blastx)

This program compares the six-frame conceptual translation products of a nucleotide query sequence against a protein sequence database to find a protein-coding gene in a genomic sequence or to see if the cDNA corresponds to a known protein.

Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx)

This program is the slowest of the BLAST family. The purpose of tblastx is to find very distant relationships between nucleotide sequences.

Protein-nucleotide 6-frame translation (tblastn)

This program compares a protein query against all six reading frames of a nucleotide sequence database. It may be used to map a protein to genomic DNA.

Large numbers of query sequences (megablast)

When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times.

Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species

Correctly identify a species or find homologous species. This can be useful when working with a DNA sequence from an unknown species.

Locating domains

Protein sequence can input it into BLAST, to locate known domains within the sequence of interest.

Establishing phylogeny

With the results received through BLAST you can create a phylogenetic tree using the BLAST webpage. Phylogenies based on BLAST alone are less reliable.

DNA mapping

When working with a known species and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest to relevant sequences in the databases. NCBI has a "Magic-BLAST" tool built around BLAST for this purpose.

Comparison

When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.

Conclusion

BLAST has become an essential tool for biologists. Its speed and sensitivity allow scientists to compare nucleotide and protein sequences to both single sequences and large databases. Most importantly, BLAST has helped democratize bioinformatics analysis and make it accessible to any researcher over the Internet.

BLAST and its descendant applications have permitted scientists to predict the functions of genes and proteins in whole genomes, answering questions in silico that could never be answered at a lab bench or in the field. The BLAST approach permits the construction of extremely fast programs for database searching for further advantage of amenability to mathematical analysis.

Updated on: 18-May-2023

324 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements