DNA Annotation: Steps Involved in Gene Annotation and the Tools Used

Keywords

DNA annotation, genome annotation, genetic material, genomic position, genomic databases, database records, eukaryotic genome, annotation tools, prokaryotic genomes.

Introduction

DNA annotation or genome annotation is the process of identifying the locations of genes and all the coding regions in a genome and determining what those genes do. An annotation is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it.

For DNA annotation, a previously unknown sequence representation of genetic material is enriched with information relating genomic position to intron-exon boundaries, regulatory sequences, repeats, gene names and protein products. This annotation is stored in genomic databases such as Mouse Genome Informatics, FlyBase, and WormBase.

The National Center for Biomedical Ontology develops tools for automated annotation of database records based on the textual descriptions of those records. Genes in a eukaryotic genome can be annotated using various annotation tools such as FINDER. A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA. Modern annotation pipelines for prokaryotic genomes are Bakta, Prokka and PGAP.

Steps Involved in Gene Annotation

Genome annotation consists of three main steps.

Identify portions of the genome that do not code for proteins.
Identify elements in the genome, a process called gene prediction.
Attach biological information to these elements.

Automatic annotation tools attempt to perform these steps through computer analysis, as opposed to manual annotation (curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.

A simple method of gene annotation relies on homology-based search tools, like BLAST, to search for homologous genes in specific databases; the resulting information is then used to annotate genes and genomes. However, as information is added to the annotation platform, manual annotators become capable of deconvoluting discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their subsystems approach. Other databases (Ensembl) rely on curated data sources as well as a range of different software tools in their automated genome annotation pipeline.

There are two types of DNA annotation:

Structural annotation consists of the identification of genomic elements. Finding the locations of ORFs, coding regions and regulatory motifs, as well as determining the gene structure, are examples of structural annotation.
Functional annotation involves attaching biological information to genomic elements, by determining which biochemical and biological functions they have, the regulatory and interaction networks they participate in, and their expression.

These steps may involve both biological experiments and in silico analysis. Proteogenomic based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomic annotations. A variety of software tools have been developed that allow scientists to view and share genome annotations, such as MAKER.

Genome annotation is an active area of investigation and involves several different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means.

Tools Used in Gene Annotation

First, there is a need to identify those structures of the genome which code for proteins. This step of annotation is called "structural annotation". It contains the identification and location of open reading frames (ORFs), identification of gene structures and coding regions, and the location of regulatory motifs. The galaxy contains several tools for structural annotation. Tools for gene prediction are Augustus (for eukaryotes and prokaryotes) and glimmer3 (only for prokaryotes).

Augustus is used for gene prediction. The genome sequence is used as an input in FASTA file and by choosing the right model organism, gff (generic feature format) output is obtained. Augustus will provide three output files: gff3, coding sequences (CDS) and protein sequences.

Functional annotation: Functional gene annotation means the description of the biochemical and biological function of proteins. Possible analyses to annotate genes can be for example:

Similarity searches
Gene cluster prediction for secondary metabolites
Identification of transmembrane domains in protein sequences
Finding gene ontology terms
Pathway information.

Applications

Disease diagnosis

Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution, and function under a different set of conditions, such as diseased versus healthy.

Bioremediation

A great diversity of catabolic enzymes involved in hydrocarbon degradation by some bacterial strains are encoded by genes located in their mobile genetic elements (MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought to acquire these hydrocarbon degradation capacities.

Conclusion

Both traditional methods are discussed for genome annotation based on homology detection and newer approaches united under the umbrella of genome context analysis. Although functions can be predicted, at some level of precision, for a substantial majority of genes in each sequenced prokaryotic genome, current annotations are replete with inaccuracies, inconsistencies, and incompleteness.

Specialized databases, designed as genome annotation tools, seem to be capable of dramatically improving the situation, if not solving the annotation problem completely. Prototypes of such databases already exist, and function and their extensive growth soon seems assured.

Swetha Roopa

Updated on: 2023-05-18T11:24:22+05:30

778 Views

Kickstart Your Career

Get certified by completing the course

Get Started