How Variant Annotation is Done


Variant annotation involves the application of computational tools and databases to systematically interpret genetic variants discovered through next-generation sequencing (NGS). Here’s a more detailed breakdown of how the annotation process is carried out:

1. Variant Identification

Before annotation, the variants (single nucleotide polymorphisms [SNPs], insertions, deletions, structural variants) need to be identified using variant calling pipelines. Tools such as GATK (Genome Analysis Toolkit), FreeBayes, or DeepVariant analyze raw sequencing data (FASTQ files) and generate variant call files (VCF format).

2. Annotation Tools

There are several well-established tools used to annotate variants. These tools incorporate reference genomes, gene annotations, population data, and other functional information. Examples include:

  • ANNOVAR: Annotates variants with information from multiple databases (e.g., RefSeq, dbSNP, 1000 Genomes).
  • VEP (Variant Effect Predictor): Developed by Ensembl, it predicts the functional effects of genetic variants.
  • SnpEff/SnpSift: Focuses on functional impact predictions and has a streamlined pipeline for clinical variant interpretation.

These tools automate the variant annotation process, streamlining the addition of biological meaning.

3. Databases Used in Annotation

Annotation tools rely on diverse data sources, such as:

  • Population Databases: Databases like gnomAD, 1000 Genomes, and ExAC provide allele frequencies across global populations. This helps determine if a variant is rare or common.
  • Disease Databases: ClinVar, OMIM, HGMD provide curated information on clinically relevant variants associated with diseases.
  • Gene Annotations: Databases like Ensembl, UCSC Genome Browser, and RefSeq provide detailed information on gene structure, transcripts, and regulatory elements.
  • Functional Prediction Tools: Tools like SIFT, PolyPhen-2, and CADD predict the functional impact of coding variants (e.g., whether a missense mutation is likely to disrupt protein function).

4. Variant Annotation Workflow

The process of annotating variants generally follows these steps:

Step 1: Preprocessing

  • Quality Control: Ensure high-quality variant calls (e.g., using filters like depth of coverage, Phred scores, variant quality).
  • Format Conversion: Convert the VCF or other formats into a form compatible with the annotation tools (some tools accept only specific formats).

Step 2: Gene Mapping

  • Map Variants to Genes: Each variant is assigned to its genomic location. If it’s within a gene, tools like ANNOVAR and VEP map the variant to the appropriate gene, exon, or regulatory region.
    • Coding Region Variants: If the variant affects coding sequences, it is annotated with changes to amino acid sequences (missense, nonsense, synonymous).
    • Non-Coding Region Variants: Variants in introns, promoters, or UTRs are annotated with their potential impact on gene regulation or splicing.

Step 3: Effect Prediction

  • Functional Impact: Functional tools predict the potential consequence of variants:
    • SIFT: Predicts whether an amino acid substitution will affect protein function based on evolutionary conservation.
    • PolyPhen-2: Assesses the possible impact of amino acid changes on the structure and function of a protein.
    • CADD: Scores the deleteriousness of both coding and non-coding variants.

Step 4: Clinical Relevance

  • Compare with Known Variants: Annotators check databases like ClinVar for any known associations between a given variant and diseases.
    • If a variant matches one in the database, it might be classified as pathogenic, likely pathogenic, benign, or variant of uncertain significance (VUS).

Step 5: Population Frequencies

  • Check Population Frequencies: Rare variants, especially those with a frequency below 1% in global databases like gnomAD or 1000 Genomes, are more likely to be associated with diseases. Common variants tend to be benign unless associated with complex traits.

Step 6: Variant Prioritization

  • Prioritization Criteria: Based on clinical significance, population rarity, and predicted functional impact, variants are prioritized for further investigation.
    • Tier 1: Variants already known to be pathogenic (e.g., in ClinVar).
    • Tier 2: Predicted deleterious variants in disease-related genes.
    • Tier 3: Rare variants with uncertain significance that may warrant further functional validation.

5. Output of Variant Annotation

The final result is typically a table or report that includes the following key elements for each variant:

  • Chromosome and Position: The genomic location.
  • Reference and Alternative Alleles: The nucleotide change.
  • Gene Name: The gene impacted by the variant.
  • Variant Effect: The type of mutation (e.g., synonymous, missense).
  • Functional Predictions: Scores from SIFT, PolyPhen-2, CADD, etc.
  • Population Frequency: Allele frequencies from databases like gnomAD.
  • Disease Association: Any associated disease or trait.
  • Clinical Significance: Whether the variant is pathogenic, benign, or VUS.

6. Challenges and Considerations

  • VUS (Variant of Uncertain Significance): A large proportion of rare variants have unknown clinical significance, which makes their interpretation challenging.
  • Interpretation of Non-Coding Variants: Non-coding variants often lack clear functional annotation, though emerging tools are improving prediction in these regions.
  • Data Integration: Combining population, clinical, and functional data into a coherent interpretation requires expert knowledge and sometimes manual curation.

7. Tools for Visualization

Some variant annotation workflows include visual tools that help researchers explore and interpret data. Tools like IGV (Integrative Genomics Viewer) provide a graphical representation of variants in the context of their genomic locations.