Skip to main content

Developing an algorithm to identify cryptic splice variants in whole exome datasets

Date

Jodie England, Dr James Poulter, Prof Colin Johnson - University of Leeds. This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Health” theme within that grant & The Alan Turing.

A new tool for splice variant identification was applied to rare variant datasets from families with a history of neurodevelopmental disorders, and potential splice-altering variant candidates were recommended for further study.

Project overview
The mechanism of mRNA maturation includes splicing – a process that is not fully understood and is implicated in various diseases including neurodevelopmental disorders. Due to the high complexity of splicing, it is difficult to identify splice variants using current bioinformatic methodologies. This project aims to develop a pipeline of pathogenic cryptic splice variant identification in exome datasets.

Data and methods
3 variant datasets were sourced from DNA sequencing of individuals diagnosed with a Neurodevelopmental Disorder and their families, filtered for rare variants.

SpliceAI, a deep learning-based tool that identifies splice variants in genomic data, was utilised to calculate the probability whether each variant in a pre-mRNA transcript is likely to become a splice donor, splice acceptor, or neither. Prospective splice site variants were input into SpliceAI, using the GRCh37 human genome annotation as a reference.

An R script was written in order to process the raw SpliceAI results and present the top 10 variants with the highest delta score probability of splice-alteration, alongside accompanying information such as the gene each variant lies within, and the position of splice altering relative to the variant position.

To assess the accuracy of SpliceAI, known splice-altering variants were ran through the program in a blind test alongside an equal number of non-splice-altering variants.

Key findings
SpliceAI successfully generated results for all 3 variant datasets, however delta scores higher than 0.1 were only found in 1 family dataset. After processing in R, the top variants identified as highly likely to cause pathogenic splice-alteration (identified as a delta score of 0.3 or higher) were identified within 3 genes; TRBV7-3, TAS2R19 and TAS2R46.

TRBV7-3 encodes T Cell Receptor Beta Variable 7-3, involved in T Cell function. This gene has no known link to neurodevelopment but generated the highest delta score (0.53), with a likelihood of forming a splice donor site. TAS2R19 and TAS2R46 encode Taste receptor type 2 member proteins, that have no known link to neurodevelopment but are involved in sensory processing, a common symptom of neurodevelopmental disorders. These variants generated delta scores between 0.35 and 0.39, with some increasing the likelihood of forming a splice acceptor site downstream from the variant and others decreasing this likelihood.

Running known splice-altering variants through SpliceAI confirmed that the program could accurately identify variants with splice-altering properties, as high delta scores were generated for these known variants but not for variants that have no role in splice-altering, effectively validating the use of this program for this project.

Value of the research
This research can be used to demonstrate that SpliceAI is a viable tool for the identification of splice-altering variants. The cryptic splice variants identified in this project can be further researched in a wet-lab setting in order to further determine whether they have a role in neurodevelopmental disorders. Increasing our understanding of how neurodevelopmental disorders function will allow for better therapeutic methodologies in the future.

Insights

  • SpliceAI is a deep learning-based tool that identifies splice variants in genomic data and was utilised to identify splice-altering variants in datasets from families with neurodevelopmental disorders.
  • After data processing in R, splice variant candidates were successfully identified with a high probability of being splice-altering, and were recommended for further study using wet-lab methodologies.

Research theme
Health Informatics

This project was undertaken as part of the LIDA Data Scientist Internship Programme.