Stelios Theophanous, Dr Lucy Stead, Prof David Westhead – The University of Leeds. This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Health” theme within that grant & The Alan Turing Institute

Project overview
Glioblastoma (GBM) is the most common, most deadly adult brain cancer. Treatment is standardised and aggressive but almost 100% of tumours recur. Previous research has found that primary GBM tumours that undergo standard treatment exhibit a universal dysregulation in expression of genes associated with a specific chromatin remodelling protein. However, the direction of this dysregulation is patient-specific. In two thirds of cases, the genes are upregulated and in a third of cases, they are downregulated. Our hypothesis is that this epigenetic switch facilitates the inevitable treatment resistance of GBM tumours. The aim of this project is to develop a machine learning classifier that predicts the direction of the epigenetic switch based on the gene expression profile of the primary tumour.

Data and methods
Prior to developing the classifier, a list of features (informative genes) that can effectively differentiate between upregulated and downregulated samples had to be identified. Several different methods of achieving this were explored, including differential gene expression analysis (DGE), principal component analysis (PCA) and recursive feature elimination (RFE). The DGE analysis was carried out using a dataset which included the raw RNA-seq counts characterising the expression of more than 22000 genes for 66 paired samples of primary and recurrent GBM tumours. It was established that the DGE approach was more robust than PCA and RFE and therefore it was used as the default feature selection method. The features identified by DGE were subsequently used to train the classifier. In order to identify the most robust algorithm for the classifier, the classification performance of 15 different algorithms was evaluated using a normalised version of the RNA-seq dataset. The final model was constructed using the algorithm that exhibited the best performance in terms of accuracy, sensitivity, specificity. Moreover, a dataset consisting of the patients’ clinical data was used to carry out survival analysis and to investigate whether the survival of patients with upregulated recurrent tumours is significantly different to the survival of patients with downregulated recurrent tumours.

Key findings
The DGE analysis was carried out using the EdgeR and DESeq2 approaches, both in R as well as using the online tool GlioVis. Upon intersecting the lists of differentially expressed genes identified by each of these 4 approaches, the final list consisted of 21 genes, which were used to train the classifier (Image 1).

The final classification was carried out using a single-hidden-layer neural network, which exhibited the best performance among all algorithms tested. The classification accuracy on the validation set was 95% whereas the classification accuracy on the test set was 71%. This implies that the algorithm is overfitting; it exhibits high performance on previously-encountered samples but does not perform optimally when new samples are encountered. It was therefore concluded that a larger number of samples is required to develop a robust classifier.

The results of the survival analysis indicate that the median overall survival of patients is 18 months. Approximately 80% of patients survive 1 year after diagnosis, however, less than 25% of the total number of patients survive 2.5 years after diagnosis. There is no significant difference in survival between patients with upregulated tumours and patients with downregulated tumours (Image 2).

Value of the research
Stratifying patients according to the way their tumour will switch may ultimately determine a treatment course which will more effectively target and kill the tumour.


  • A larger number of samples is required to develop a robust classifier.
  • The survival rate of patients 1 years after diagnosis is 80% and the survival rate 2.5 years after diagnosis is less than 25%.
  • There is no significant difference in survival between patients with upregulated tumours and patients with downregulated tumours

Research theme

  • Neuro-oncology
  • Bioinformatics
  • Machine learning

Image 1. Venn diagram depicting the intersection of the lists of differentially expressed genes identified through 4 different methods. A total of 21 genes were identified as differentially expressed by all methods and were used for further analysis.

Image 2. Survival curves for overall survival using all samples separated by responder subtype.

This project was undertaken as part of the LIDA Data Scientist Internship Programme.