Thoughts on the LIDA Internship Programme – Vijay Kumar

Date: Friday 21 February 2020

Do you have a research or industry figure role model, and what about them inspires you?

A role model of mine would be my previous supervisor during my last year of university, Ben Varcoe. He is a professor at the University of Leeds, and chair of Experiment Quantum Information Science. I admire Professor Varcoe’s ability to apply complex theory and models in unintuitive situations to find innovative solutions to problems. This is shown in his development of a medical device that revolutionises the way magnetocardiography is used. This ability to problem solve in the most innovative of ways is something I wish to hone in my own career.

How would you describe your project in a couple of sentences to someone who has no background in data science or analytics?

Pollution has been linked to many health ailments over the past years, yet analysis of its effects on a large population continues to be problematic due to irregular and poor data collection. My project aims to use new statistical models to better interpret this data and uncover the underlying truths that link pollution, exercise habits and Chronic Obstructive Pulmonary Disease (COPD) hospital admissions in Leeds.

What do you see as being the real-world impact of your project research, and what are its applications?

The implications of this project are twofold. Firstly, we aim to uncover information about the underlying public health risks pollution has locally in our society. This means to identify vulnerable people who may be more susceptible to pollution related illness, to uncover the true scale of the effects pollution has on our health, to see how other habits such as exercise can offset these effects, and to be able to make predictions on COPD hospital admissions in the future. Such information could be used to advise policy and initiatives in and around public health, whilst also being used in a predictive way to better plan the distribution of health resources. Secondly, the models that have been designed for this data can be used in many other areas of statistics. The models and software created are designed to make better use of mismatched sparse and dense, temporal and spatial data. The underlying methods can be potentially used in many geographical disciplines and stretches beyond just health.

What do you see as being the most pressing challenge facing data scientists in the modern age and how will the LIDA Internship Programme help you personally with this challenge?

I believe that one of the most challenging parts about being a data scientist is keeping up with a rapidly changing and always evolving field. In this world, techniques are constantly being refined and bettered, and there is always a new and interesting method to try out. However, this can be somewhat daunting for a young data scientist as you begin to realise that in the world of data things are a lot more dynamic than you’ve been used to in your studies. One thing that has helped me keep up with the pace of data science is the great community LIDA has created. There is always opportunities to present your work and receive feedback from experts. If you are confused by something there will be someone who has done something similar close by who will be happy to sit down with you and discuss the methods. This shared community of knowledge is a great asset to a young data scientist.

What new software/programming language has your LIDA internship introduced you to and how has this been a benefit?

In my previous experiences I have mainly used Python to do all my work. However, in this internship I have been able to explore other languages such as R. I have enjoyed using R as my main language for statistical analysis. The R community is great and full of lots of statistical experts, because of this there is a vast amount of packages for R that are available that incorporate nearly everything a data scientist needs. Also, many academic papers release their methods as R packages, so if you find a paper interesting you can download the package and try it out straight away.

What challenges have you encountered with data and how have you overcome them?

One challenge I have had with data sets is dealing with multiple spatial data sets with different geographies. For example, data collection around exercise in Leeds is collected by postcode district, where as population data is by LSOA (a separate geography defined by the Office of National Statistics). Matching these two can be difficult as there are overlapping sections. The best method of dealing with this is to try and understand the data as best you can, map it, see where both geographies lie, and let this inform your solution. Sometimes a solution is obvious, in this case many overlapping segments had extremely similar demographics and were split pretty evenly, so the best course of action was to split the overlaps between two districts.

Are there any fields of research or industry that you would like to apply and develop your skills in, after your LIDA internship finishes?

An interesting field that I may want to try my hand at is natural language processing (NLP). It would be interesting to learn the ways in which documents can be converted into vector spaces for analysis. Talking to experts in this field you quickly realise the many quirks of the human language and the challenge created by trying to produce a program that understands things such as sarcasm. There are many different implementations of this, for example processing medical documents from patients to group symptoms that could indicate a risk of a specific disease.

What advice would you give to someone who is thinking of applying to the LIDA Internship Programme?

Something I wish I had done before starting this internship was to brush up on my advanced statistics. Coming from a physics background, I had been exposed to a high level of statistics but the notation was different and some of the rigour was ignored. Statistics is at the core of all data science so having a good understanding of the core mathematics that underpins it all will help you realise what is possible and not possible with the data you have. Furthermore, it will aid in implementing new techniques and interpreting results.

Find out more about LIDA Data Scientist Internship Programme here.