In September 2016, Leeds Institute for Data Analytics launched the Data Scientist Internship Programme as part of our commitment to developing data science capability and driving multidisciplinary data analytics research.
Alex Coleman is one of 11 interns currently part-way through his internship, we’ve asked him to write about his experience so far as a LIDA intern.
I applied to the Leeds Institute for Data Analytics (LIDA) Intern programme whilst I was finishing my PhD in Molecular Virology at the University of Leeds. My PhD project had been completely lab-based, looking into how a cancer-causing virus remodelled structures within our cells to aid with its infection. It was a fascinating project and great to work in a cutting-edge laboratory but highlighted to me the enormous volume of data that labs generate. As my PhD progressed, I felt that lots of this data was never quite tapped into and I became interested in big data and machine learning approaches that might fully extract insight from this information.
I used online resources and books to learn R and Python and developed a rather rusty, cobbled together an understanding of data science. I took this a step further in my spare time by building an app in R (using the Shiny package) to help mobilise activists in local elections across Leeds (www.mynearestleedsmarginal.com). It had some modest success and it impressed upon me the importance of developing tools that anyone can use.
Despite my biological background, I was (gratefully) thrown straight out of my comfort zone by getting a project that sought to apply computational methods to criminological problems. This project was based on the premise that whenever crime is recorded police officers also provide a narrative description of the event. This description often contains rich contextual information that due to its unstructured nature is typically underutilised by police and their crime reduction partners. Working in collaboration with Safer Leeds my project aimed to explore how text mining and natural language processing techniques like topic modelling might identify different offence modus operandi from this narrative crime report data. With both a limited experience of natural language processing and of policing and criminology I whole-heartedly dived into the project.
I spent a long time reading up on natural language processing approaches in python and explored the more cutting-edge topic modelling approaches that utilise graph theory. It’s been a lot of experimentation in Jupyter Notebooks to identify suitable approaches and some time waiting for the data to arrive but I’m confident from our preliminary results that the approaches employed could be really useful to police forces and criminologists.
The approaches I’m developing have the potential to help police and government analysts better classify the means by which offenders commit commonly occurring crimes like burglary and vehicle theft – which in turn could inform real world crime reduction efforts. Moving forwards we’d like to build early warning systems for detecting new emerging methods of offending like mobile phone thefts by offenders travelling on mopeds. Without a doubt, it’s been a fantastic project to sink my teeth into with tangible real-world impacts and has switched me on to the power of natural language techniques.
At the start of my internship, I said I hoped to improve my coding skills, my understanding of machine learning techniques and improve my skills taking a machine learning model and developing it into a minimum viable product. I certainly still hope to get those things out of my internship (and have definitely made progress on all counts) but I also now hope to expand my knowledge base in different subjects (like criminology) and contribute to the growing environment of LIDA. I’ve seen from my short time here the fantastic work that goes on in LIDA in all sorts of disciplines and honestly believe that it has the potential to rival the Turing Institute as the best place to do data science and analytics in the UK.
What’s more, there’s a great community here of PhDs, interns, academics, post-docs, and support staff. We aren’t just performing cutting edge research but also building a broad-based community that talks about data science in the real world and how it can improve people’s lives. Whether that’s in the office whilst we’re making our tea or coffee, or in a booth during a meeting or even at the pub during the weekly LIDA Pub Thursday events. So, I hope that, as well as getting things out of my internship, I’m able to give back and help contribute to the institute.
I couldn’t do a blog post about my internship without talking about the intern team, because seriously, it’s a super team. We’ve got a fantastic array of talent with people with masters in GIS, a PhD in mathematics, and much more. We almost all (sorry Ivy!) sit together in the office so we share the ups and downs of the week.
There’s always someone at hand who knows how to configure your Virtualbox, simplify your shapefile or point out your missing bracket when your code won’t run. It’s been great working side by side with people from such different disciplines, tapping into their experience and knowledge, and learning about the projects they’re working on. We keep things fun too with a weekly cake club, LIDA pub Thursdays and lunch in the LIDA kitchen where amusement is never in short supply. It’s a great working environment to be part of and I’ll be sorry to say goodbye when the year is over.
I’m unsure of where next, I have really enjoyed my work with the computational criminology project and am keen to continue my next project within the same discipline. After my internship, I’d be interested to continue to do research (particularly in the field of crime science) utilising these computation techniques but would also look for any opportunities to apply what I’ve learned on a project that looks to make a big real-world impact.