Predicting the Impact of Variants on Gene Function (Pathogenicity Scoring)


Understanding how genetic variants impact gene function is essential, especially in fields like clinical genomics, where it helps us determine whether a variant is likely to cause disease. Over time, scientists have developed several computational tools to predict the functional and clinical significance of variants. These tools analyze features such as evolutionary conservation, protein structure, and how common a variant is in the population to estimate its potential impact on gene function and its likelihood of causing disease.

Key Approaches to Predicting Pathogenicity

Pathogenicity scoring tools can generally be grouped into two main types:

  1. Functional Impact Prediction: These tools try to predict whether a variant will disrupt protein function or gene expression.
  2. Clinical Pathogenicity Scoring: These tools focus on assessing the chance that a variant is associated with a disease.

1. SIFT (Sorting Intolerant From Tolerant)

  • What it does: SIFT predicts whether an amino acid change will affect protein function by looking at how much that part of the protein has been conserved across different species.
  • How it works: If a particular amino acid is often conserved, a change there might be harmful. SIFT gives scores between 0 and 1. A score ≤0.05 means the change is likely harmful, while scores above 0.05 mean it's probably fine.

2. PolyPhen-2 (Polymorphism Phenotyping v2)

  • What it does: PolyPhen-2 predicts whether an amino acid change will impact the structure or function of a protein.
  • How it works: It uses information like how conserved the sequence is and how the structure of the protein might change. It categorizes variants as probably damaging, possibly damaging, or benign.

3. CADD (Combined Annotation Dependent Depletion)

  • What it does: CADD scores both coding and non-coding variants for their potential deleterious effects.
  • How it works: It uses a combination of many features, including conservation and regulatory annotations, to come up with a score. A score above 20 usually indicates a variant in the top 1% of deleterious variants.

4. MutationTaster

  • What it does: MutationTaster predicts the disease-causing potential of various types of variants, including amino acid changes and small insertions or deletions.
  • How it works: It looks at evolutionary conservation, splicing effects, and other gene-related information to classify a variant as either disease-causing or polymorphism.

5. FATHMM (Functional Analysis through Hidden Markov Models)

  • What it does: FATHMM predicts the functional impact of coding and non-coding variants.
  • How it works: It uses evolutionary conservation and sequence similarity across species to assess the likelihood of a variant being harmful. The lower the score, the more likely the variant is harmful.

6. REVEL (Rare Exome Variant Ensemble Learner)

  • What it does: REVEL focuses on missense variants and uses machine learning to predict their pathogenicity.
  • How it works: It combines predictions from other tools (like SIFT, PolyPhen-2, etc.) to come up with a consensus score. Higher scores mean a higher likelihood of pathogenicity.

7. PrimateAI

  • What it does: PrimateAI uses deep learning to predict the likelihood that a missense variant will be pathogenic.
  • How it works: It compares data from primates and other species to assess evolutionary conservation. Scores range from 0 to 1, with higher scores indicating a greater chance the variant is harmful.

8. DANN (Deleterious Annotation of Genetic Variants using Neural Networks)

  • What it does: DANN predicts the pathogenicity of both coding and non-coding variants using deep learning.
  • How it works: It analyzes genomic features and annotations and assigns a score between 0 and 1, with higher scores indicating a greater likelihood of being harmful.

9. ClinPred

  • What it does: ClinPred predicts the clinical significance of a variant.
  • How it works: It integrates features like evolutionary conservation and allele frequencies to estimate whether a variant is pathogenic or benign.

10. Eigen/Eigen-PC

  • What it does: Eigen scores the deleteriousness of both coding and non-coding variants using functional genomic data.
  • How it works: It applies principal component analysis (PCA) to various genomic features to generate a score, with Eigen-PC refining the scores by incorporating known pathogenic variants.

11. M-CAP (Mendelian Clinically Applicable Pathogenicity)

  • What it does: M-CAP improves the classification of rare missense variants, especially in Mendelian diseases.
  • How it works: It combines functional prediction scores (like SIFT, PolyPhen-2, CADD) with clinical features to generate a final score.

12. VEST (Variant Effect Scoring Tool)

  • What it does: VEST predicts the pathogenicity of non-synonymous variants, particularly in cancer-related genes.
  • How it works: It uses machine learning to analyze multiple types of data and produce a score. Higher scores indicate a higher likelihood of pathogenicity.

13. MutPred

  • What it does: MutPred predicts how amino acid substitutions affect the molecular function of proteins.
  • How it works: It uses machine learning to analyze the molecular changes that might be caused by a variant, giving a probability score indicating the likelihood of pathogenicity.

Using Pathogenicity Scoring in Clinical Contexts

In clinical genomics, these scoring tools are often combined to make more informed predictions. The American College of Medical Genetics and Genomics (ACMG) guidelines suggest integrating multiple lines of evidence when classifying variants. These guidelines help clinicians determine whether a variant is pathogenic, likely pathogenic, benign, or a variant of uncertain significance (VUS), by considering factors like:

  • Population data,
  • Computational predictions,
  • Functional studies,
  • Family history, and
  • Clinical observations.

Example Workflow for Predicting Pathogenicity:

  1. Identify variants: Use sequencing data to find variants.
  2. Run predictions: Apply algorithms like SIFT, PolyPhen-2, REVEL, and CADD to assess each variant.
  3. Cross-reference with databases: Compare the results with known data from resources like ClinVar and OMIM.
  4. Expert review: Have experts manually review the data to ensure consistency with clinical findings and the latest research.

Pathogenicity scoring tools provide a starting point for evaluating the potential impact of variants, using a range of data sources like evolutionary conservation and protein function. Since no single tool is perfect, it’s common practice to combine the results from multiple algorithms to make more reliable predictions. In clinical settings, these tools are key for determining whether a genetic variant is likely to be disease-causing, helping clinicians make informed decisions for patients.

These algorithms have revolutionized how we analyze genetic data, especially in the context of diagnosis and personalized medicine..