Coverage in DNA Sequencing and its Types

Keywords

DNA sequencing, next-generation sequencing, genetics, sequencing cost, study design, rare variant, nucleotide.

Introduction

Coverage is one of several measures of the depth or completeness of DNA sequencing and is more specifically expressed in genetics. Coverage describes the number of sequencings reads that are uniquely mapped to a reference and “cover” a known part of the genome. Ideally, the sequencing reads that uniquely aligned are uniformly distributed across the reference genome and hence provide uniform coverage.

The number of sequencings that reads that map to a known region is also an important part of coverage. Coverage is not uniform and may be underrepresented in genetic regions of interest due to a variety of factors. These include the fact that the genome itself is complex, containing genes, noncoding DNA, repetitive sequences, and other elements that can make it difficult to align the sequencing read to the proper genomic coordinates.

Coverage is defined as the number of sample nucleotide bases sequence aligned to a specific locus in a reference genome. Enough properly mapped reads are required to find and correctly identify genetic mutations.

With high sequencing coverage, researchers can find the proverbial ‘needle in the haystack’, able to identify low frequency mutations or discover mutations in a heterogeneous sample such as a tumor biopsy. Poor coverage, whether due to an insufficient number of reads or sequencing reads that are mapped incorrectly, will result in the inability to detect the variants of interest.

Types of Coverage

1. Sequence coverage

Sequence coverage (or depth) is the number of unique reads that include a given nucleotide in the reconstructed sequence. Deep sequencing refers to the general concept of aiming for a high number of unique reads of each region of a sequence.

Rationale

Even though the sequencing accuracy for each individual nucleotide is very high, the very large number of nucleotides in the genome means that if an individual genome is only sequenced once, there will be a significant number of sequencing errors. Many positions in a genome contain rare single-nucleotide polymorphisms (SNPs). Hence to distinguish between sequencing errors and true SNPs, it is necessary to increase the sequencing accuracy even further by sequencing individual genomes many times.

Ultra-deep sequencing

The term "ultra-deep" can sometimes also refer to higher coverage (>100 fold), which allows for detection of sequence variants in mixed populations.

Transcriptome sequencing

Deep sequencing of transcriptomes, also known as RNA-Seq, provides both the sequence and frequency of RNA molecules that are present at any time in a specific cell type, tissue, or organ. Counting the number of mRNAs that are encoded by individual genes provides an indicator of protein-coding potential, a major contributor to phenotype. Improving methods for RNA sequencing is an active area of research both in terms of experimental and computational methods.

Calculation

The average coverage for a whole genome can be calculated from the length of the original genome (G), the number of reads (N), and the average read length (L) as N × L/G. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called breadth of coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in base calling and assembly. The subject of DNA sequencing theory addresses the relationships of such quantities.

2. Physical Coverage

Physical coverage, the cumulative length of reads or read pairs expressed as a multiple of genome size. Sometimes a distinction is made between sequence coverage and physical coverage. Where sequence coverage is the average number of times a base is read, physical coverage is the average number of times a base is read or spanned by mate paired reads.

3. Genomic Coverage

Genomic coverage, the percentage of all base pairs or loci of the genome covered by sequencing. In terms of genomic coverage and accuracy, whole genome sequencing can broadly be classified into either of the following:

A draft sequence, covering approximately 90% of the genome at approximately 99.9% accuracy.
A finished sequence, covering more than 95% of the genome at approximately 99.99% accuracy.

Producing a truly high-quality finished sequence by this definition is very expensive. Thus, most human "whole genome sequencing" results are draft sequences.

Conclusion

Having coverage is clearly important to ensure that the genomic region of interest can be studied with high confidence. For regions with little to no coverage, researchers frequently increase the sequencing throughput for their studies. That is, obtain more sequencing reads and data to increase coverage for a genetic region by brute force.

However, this method is inefficient, increases costs, and does not address the underlying reasons for the poor coverage itself. By increasing throughput, genomic regions with sufficient coverage will now be over-represented and the reads are in effect, wasted. Areas with zero coverage before may not have coverage just by sequencing more samples. A more efficient way to address coverage is by using a targeted sequencing approach. This provides the benefit of ensuring sufficient coverage, including in parts of the genome that may not have been accessible previously, with lower sequencing costs.

Swetha Roopa

Updated on: 18-May-2023

80 Views

Kickstart Your Career

Get certified by completing the course

Get Started