Error Rates - Ngs Analysis ToolKit

Sequencing Errors and Their Causes

Sequencing errors are a common challenge in next-generation sequencing (NGS) and can affect downstream analyses, including variant calling, genome assembly, and gene expression quantification. Understanding the causes of these errors and applying quality control (QC) steps is crucial to improving data reliability.

Types of Sequencing Errors

Substitution Errors
- Occur when an incorrect base is incorporated during the sequencing process (e.g., an "A" is called as a "G"). This can happen due to signal misinterpretation, especially in regions with low-quality base calls.
Insertion and Deletion (Indel) Errors
- These errors occur when a base is incorrectly added (insertion) or omitted (deletion) in the read. Indel errors are more common in long-read sequencing platforms like PacBio and Nanopore, where signal interpretation over longer reads can become less precise.
Homopolymer Errors
- These errors happen in regions of repetitive bases, such as "AAAA" or "GGGG," where sequencing technologies may struggle to accurately determine the length of the homopolymer. This issue is especially prominent in technologies like Ion Torrent.
Phasing and Pre-phasing Errors (Illumina)
- Phasing errors occur when nucleotides in a cluster are out of sync during sequencing by synthesis (Illumina). When a strand lags (phasing) or moves ahead (pre-phasing) of the rest of the cluster, it can result in miscalled bases.
GC-content Bias
- Regions of the genome with very high or very low GC content are more likely to have errors. For example, regions with high GC content may be difficult to amplify, leading to incomplete sequencing, while low GC content may result in inefficient binding during sequencing.
Context-specific Errors
- Certain sequence contexts, such as repetitive regions or highly structured sequences (like hairpins), can lead to sequencing artifacts, especially in short-read platforms that rely on amplification.
Chimeric Reads
- Sometimes, fragments from different parts of the genome are accidentally joined during library preparation, producing reads that do not correspond to any actual sequence in the sample. These are known as chimeric reads and can lead to misalignment and false variant calls.

Causes of Sequencing Errors

Platform-specific Limitations
- Each sequencing technology has inherent strengths and weaknesses. For example:
  - Illumina: Generally provides high accuracy but can suffer from phasing/pre-phasing errors in longer read lengths and difficulty with homopolymers.
  - Nanopore/PacBio: These long-read technologies provide greater read length but tend to have higher error rates in raw reads, especially in the form of insertions and deletions.
Sample Quality
- Poor sample quality (e.g., degraded or fragmented DNA/RNA) can lead to incomplete or low-quality reads. Contaminants in the sample can also interfere with the sequencing chemistry, contributing to errors.
Library Preparation Errors
- Mistakes during fragmentation, adapter ligation, or PCR amplification can introduce errors. PCR amplification, in particular, can introduce bias or errors due to over-amplification of certain regions.
Cluster Formation (Illumina)
- Uneven cluster formation on the flow cell can lead to clusters that are too close together, which may cause misinterpretation of signals from adjacent clusters.
Signal Decay and Saturation
- In sequencing-by-synthesis platforms like Illumina, the fluorescent signal can decay over time, making it harder to accurately call bases at the end of longer reads. Alternatively, signal saturation can occur when the fluorescent signal is too strong, leading to incorrect base calling.
Instrument Errors
- Mechanical or calibration errors in the sequencing machine can affect the accuracy of the raw reads, especially over long runs. Variability in flow cells or reagents can also lead to fluctuations in data quality.

Quality Control (QC) Measures

Quality control is essential for identifying and mitigating sequencing errors before downstream analysis. Key QC steps include:

Raw Read Quality Assessment
- Tools like FastQC or MultiQC are commonly used to evaluate the overall quality of raw sequencing reads. They generate reports that include:
  - Per-base quality scores: These indicate the confidence in the base call at each position of the read. Lower quality at the ends of reads is typical, but consistently low scores may indicate issues.
  - GC-content distribution: Helps identify bias in the sequencing process.
  - Adapter contamination: Detects whether adapter sequences are still present in the reads, which can lead to errors in alignment.
  - Per-sequence quality: Shows the distribution of quality scores across all sequences to highlight variability.
Trimming and Filtering
- Adapter Trimming: Removing adapter sequences left over from library preparation using tools like Trimmomatic or Cutadapt ensures that downstream alignment and assembly are not affected by non-biological sequences.
- Quality-based Trimming: Bases with low quality scores can be trimmed from the ends of reads to ensure that only high-confidence bases are used for further analysis.
- Length Filtering: Removing sequences that are too short after trimming to avoid biases in downstream analysis.
De-multiplexing
- In cases where multiple samples are pooled and sequenced together, barcodes or indices are used to identify each sample. Errors in de-multiplexing can lead to sample cross-contamination, so proper barcode detection and removal are critical.
Error Correction Algorithms
- Tools like SPAdes or Pilon can be used to correct sequencing errors in the reads by using statistical models or combining information from multiple overlapping reads to resolve ambiguities.
Read Alignment QC
- After mapping the reads to a reference genome, QC tools like Picard or Samtools can check for alignment statistics, such as:
  - Percentage of reads mapped: A low percentage may indicate errors in sequencing or contamination.
  - Coverage uniformity: Unusual spikes or dips in coverage can indicate sequencing bias or errors.
  - Duplicate reads: These can arise from over-amplification during PCR, leading to biased variant calls.
Downstream Data Filtering
- During variant calling, filters can be applied to exclude variants that are likely due to sequencing errors, such as those with low read depth or low-quality scores.

Error Mitigation

Increasing Read Depth: Higher coverage reduces the impact of random errors, as true biological signals are more likely to be supported by multiple reads.
Replication: Running sequencing experiments in replicates helps identify and filter out sequencing artifacts, ensuring that observed patterns are reproducible.
Platform Integration: Combining short- and long-read sequencing technologies can help mitigate errors unique to each platform. For example, PacBio or Nanopore long reads can resolve complex regions that are misrepresented in Illumina short reads.