Theoretical Background on Variant Calling


There appear to be no benchmarks for translating WES/WGS into clinical knowledge. This is because different disorders require multiple approaches to basic genomic variant analysis as well as supplemental analyses such as disease-specific interpretation and variant prioritization.Most labs employ workflows that involve several steps that gradually filter and prioritize variants for phenotype cross-correlation to maximize analysis efficiency. The remaining variants are prioritized based on additional characteristics, such as the variant's functional impact [Hedge et al., 2017]. Several bioinformatics tools have advanced in their ability to prioritize candidate disease genes from disease gene loci.

The NGS bioinformatics pipeline describes the set of bioinformatics algorithms used to process NGS data. It is typically a series of transformations that instruct and process massive sequencing data and their associated metadata using multiple software components, databases, and operating environments (hardware and operating systems). Specifying a human reference genome, acknowledging the limitations of predicting copy number and structural variation, establishing algorithms for characterizing genetic variants, evaluating publicly accessible annotation resources, and developing filtering metrics for disease-causing variants are all steps in developing a bioinformatics pipeline (SoRelle et al., 2020].

A comprehensive pipeline that can be applied for WES/WGS data analysis consists of the following steps :

  1. Preprocessing of Sequencing Data

    • Quality Control (QC) :

      Assess the quality of raw sequencing reads to identify and address issues such as low-quality reads, adapter contamination, and sequencing errors. Tools like FastQC are commonly used.

    • Trimming and Filtering :

      Remove low-quality bases and adapter sequences from reads using tools such as Trimmomatic or Cutadapt.

    • Read Alignment :

      Align the cleaned reads to a reference genome using aligners like BWA (Burrows-Wheeler Aligner), Bowtie2, or STAR. The result is a file in BAM format, which contains the mapped reads.

  2. Post-Alignment Processing

    • Sorting :

      Organize the aligned reads in the BAM file by their position on the reference genome. This step is typically performed using tools like SAMtools or Picard.

    • Marking Duplicates :

      Identify and mark duplicate reads that arise from PCR amplification during library preparation, which can lead to biases in variant calling. Tools like Picard’s MarkDuplicates are used.

    • Base Quality Score Recalibration (BQSR):

      Adjust the base quality scores of reads to correct systematic errors using tools like GATK (Genome Analysis Toolkit).

  3. Variant Calling

    • Call Variants :

      Detect variants (SNPs and indels) from the processed BAM file. Popular variant callers include GATK HaplotypeCaller, SAMtools, FreeBayes, and Varscan. This step results in a Variant Call Format (VCF) file containing the detected variants.

    • Variant Filtering :

      Apply filters to the called variants to remove false positives and retain high-confidence variants. Filtering criteria may include read depth, variant allele frequency, and quality metrics. Tools like GATK’s VariantFiltration can be used.

  4. Variant Annotation and Interpretation

    • Annotate Variants :

      Enrich the VCF file with functional information about the variants, such as their impact on genes, potential pathogenicity, and clinical relevance. Tools like SnpEff, VEP (Variant Effect Predictor), and ANNOVAR are commonly used for this purpose.

    • Interpret Variants :

      Evaluate the biological significance of the variants in the context of the disease or phenotype being studied. This involves correlating variants with known disease associations, assessing their functional impact, and integrating other data such as family history or phenotypic information.

  5. Validation and Reporting

    • Validate Variants :

      Confirm the presence of variants using additional techniques such as Sanger sequencing, especially for critical findings or those with high clinical relevance.

    • Generate Reports :

      Create comprehensive reports detailing the identified variants, their potential impact, and recommendations based on the analysis. This often involves summarizing findings in a way that is understandable and actionable for clinical or research purposes.

Due to the automated nature of a typical clinical bioinformatics pipeline, adequate quality control (QC) is necessary to ensure that the data collected is reliable, accurate, reproducible, and identifiable. The computing resources required to sequence, process, store, and interpret massive amounts of data are determined by the tools and pipelines used to evaluate such data. Each diagnostic laboratory performing WES and WGS has its own analytical pipeline, demonstrating the size and scope of clinical tools available. These pipelines frequently employ open-source, proprietary, and commercial software. The bottleneck in WGS/WES is data management and computational analysis of raw data, not sequencing itself. Each phase of the analysis workflow must be considered carefully in order to produce meaningful results. This requires the careful selection of tools for these analyses. Most of these tools have the limitation of focusing on a single element of the entire process rather than delivering an automated pipeline that can guide the researcher from start to finish. While sequencing technology and software are essential, so are the internal parameters used by each algorithm, especially the filtering options used by variant callers, which have been shown to affect overall variant call quality.

To the greatest extent possible, laboratories analyze their data using either in-house developed or commercially available software and pipelines. While these pipelines vary by laboratory, a reasonable first approach is to filter out common variants using population databases. Each phase of the data analysis pipeline, from initial raw data processing to downstream variant filtering, plays a crucial role in the final results. Poorly managed data or inappropriate tool usage can lead to erroneous interpretations or missed variants.