Data Study Group in Partnership with Newcastle University and Alan Turing Institute

Category: Artificial Intelligence; Computer science; Engineering; Mathematics; Statistical Data Science
Date: Monday 13 March 2023

Date: Monday 13 March 2023, 8am - 5pm
Location: The Catalyst, Science Square, Newcastle upon Tyne NE4 5TG
Category: Artificial Intelligence; Computer science; Engineering; Mathematics; Statistical Data Science

A co-hosted Turing event between Leeds Institute for Data Analytics (LIDA) and Newcastle University

An intensive ‘collaborative hackathon’, this event will bring together organisations from industry, government, and the third sector, with talented multi-disciplinary researchers from academia. Organisations act as Data Study Group ‘Challenge Owners’, providing real-world problems and datasets to be tackled by small groups of highly talented, carefully selected researchers. Researchers brainstorm and engineer data science solutions, presenting their work at the end of the week.

Participants are typically PhD students and early career academics from statistics, computer science, engineering, mathematics, and computational social science, as well as wider disciplines where data science and AI skills are increasingly becoming relevant. You will need to be able to commit to an intensive five days at The Catalyst, based in the Newcastle city centre.

Closing date: 17 February

Confirmation of place: 24 February

Week 1: The precursor stage (part-time / online)

The precursor stage takes place online during the week before the event (13th – 17th March 2023)
Online workshops / presentations / team-building in order to prepare for the ‘event stage’.

Week 2: The event stage (full-time / in person)

This in-person event will take place at The Catalyst, 3 Science Square, Newcastle Helix, Newcastle, NE4 5TG

The ‘event stage’ will run over one week (20th March – 24th March)
It’s expected that participants will spend around 9 hours per day on Tuesday, Wednesday and Thursday working on the challenges – please note that it is not uncommon for participants to work 12 hours days during the week should they wish.
The event will finish by 3pm on Friday 24th March.

The Challenges

Volcano Deformation from Space (LIDA)

It is estimated that one in ten of the Earth’s population lives within 100 km of a volcano that has the potential to erupt. This creates a pressing requirement for systematic monitoring of subaerial volcanoes, yet the majority remain unmonitored. However, with the advent of the the European Space Agency’s Sentinel-1 constellation, free and open synthetic aperture RADAR data is acquired for most of these volcanoes and provides times series of measurements of ground deformation which can be used for volcano monitoring. However, with ~1500 volcanoes with the potential to erupt, two satellite look angles for each volcano, and updates to time series every six or 12 days, the search for deformation signals requires automation for monitoring.

Outcomes from the project will be assessed for their potential to guide AMRC in improving their data gathering and analysis pipelines and ultimately in driving improvements to machining processes.

Detecting and Locating Earthquakes with Machine Learning (LIDA)

Detecting the occurrence of earthquakes and finding out where they happened is needed in all sorts of settings, ranging from eruption monitoring at volcanoes to tracking induced earthquakes below geothermal power plants. In many cases where monitoring of seismicity (occurrence of earthquakes) is mandated by regulation, such as where hydraulic fracturing for shale gas takes place, millions of small-magnitude earthquakes may occur over a few hours.

Existing techniques to automatically detect and locate earthquakes are still relatively computationally expensive, require manual intervention, or crucially rely on uncertain prior information about the structure of the ground through which the seismic waves pass. Recently, several algorithms to automatically classify seismic recordings of earthquakes have been developed, which are then used with traditional methods to locate the events using essentially triangulation of the wave arrival times, and sometimes the earthquake–station distance. However, these approaches make use of just a single recording at a time, and do not use the relative positions of the recordings at all. They also still rely on being able to accurately predict how long energy takes to travel from the earthquake to each station, which requires accurate knowledge of the structure of the ground.

The challenge in this hackathon is to obtain the occurrence and location of earthquakes from the data, without manual steps or the use of separate earthquake location algorithms. Meeting this challenge would mark a huge improvement to our ability to monitor earthquakes and manage the risks they pose.

Exploring Multimorbidity and Patterns of Long-term Conditions in England using Open Prescription and Primary Care Data

The prevalence of multimorbidity (the presence of ≥ 2 long-term health conditions) is increasing because of population ageing and improved medical care. Multimorbidity is associated with a range of adverse health states and outcomes, including polypharmacy, frailty, dependency, poor quality of life, and premature mortality. As such, developing a greater understanding of multimorbidity and how it may be treated is a major strategic priority. Although no concrete definition of polypharmacy exists, it is commonly operationalised as “the routine use of five or more medications”. Polypharmacy and multimorbidity are overlapping concepts; the greater the number of diagnoses a person receives, the more medications they are likely to be prescribed. As such, the prescribing patterns at the population level can tell us a great deal about the multimorbidity state of the underlying population.
Using open prescribing data available at the General Practice (GP) level, this Data Study Group (DSG) challenge aims to use the prescribing patterns of common medications used in the treatment of long-term conditions (LTCs) as a proxy to explore multimorbidity in England. The specific objectives of this challenge include:

To use the English Prescribing Dataset (EPD) to identify the prescription rates of key medications used in the treatment of LTCs, and model these as a proxy for multimorbidity
To use these data to map multimorbidity in England at a suitable area level (e.g. wards, local authorities)
To explore whether changes in prescribing patterns may be used to track trends in multimorbidity, at the population level
To explore determinants of multimorbidity in England, based on linked GP-level and area-based data (e.g. Quality Outcome Framework [QoF], Public Health England [PHE], Office for National Statistics [ONS] data)
To identify medication class combinations that co-occur more often than should expected by chance (suggesting the enrichment of a specific multimorbidity phenotype, in a region).

Applying Graph Neural Nets in the freshwater environment (The Rivers Trust)

Freshwater environments are among the most diverse and most threatened systems on the planet. A major challenge in the management of freshwater environments is their highly interconnected and dynamic nature. Individual pressures can have wide-ranging impacts at multiple scales within the river environment, both at the point of impact and downstream. To ensure the sustainable management of these environments, a clear statistical understanding of the spatial extent of pressures is required.

This requires developing models that embrace the interconnected nature of rivers. Graph Neural Nets (GNNs) have become a popular avenue of research in machine learning, capable of handling complex spatial structures and uneven interactions between nodes, as well as being extended to investigate how these dynamics can vary over time. The recent development of high-density sensor networks in freshwater ecosystem allows for the exploration of these techniques in the freshwater environment.

This challenge will investigate the applicability of GNNs in the freshwater environment, investigating limiting factors such as the impact of dams and weirs on model accuracy, and the optimum number of sensors and distribution to be able to generate an interpolated baseline for a catchment.

Exploring uncertainty in optimal operations of electrical systems (Northern Powergrid)

As electricity networks evolve to respond to the needs of the UK’s net zero commitment by 2050, new methods of managing and expanding our electricity system at the pace, cost and system efficiency are being deployed throughout the country. Novel electrical engineering systems such as smartgrids, economic ‘nudges’ such as time of use charging, environmental impact tracking such as carbon monitoring and governance systems such as distributed system operation are being created to achieve these net zero ambitions.

Generative AI for Biofilm Analysis

Biofouling is the unwanted colonisation of marine/aquatic organisms on immersed substrates – which includes on ships hull. Biofouling on ships’ hulls increases hull surface roughness, which in turn increases frictional resistance and ultimately increases fuel consumption and total emissions of a ship. A biofilm is a type of biofouling and is a slimy layer made of living microorganisms embedded in an extracellular polymeric matrix. Biofilms on ships are known to increase the drag penalty by up to 40%.

Surveys of ships around the world show that, even visually, not all marine biofilms are the same. Variance in biofilms, attributed to differences in both the composition and community, results in differences to the frictional drag.

Historically it was a real challenge to measure the surfaces of biofilms but more recently, biofilm imaging has opened up with the adoption of a technology called optical coherence tomography (OCT). OCT lets us image down through living biofilms over a surface area approximately the size of a penny. We have been using OCT imaging to get a closer look at the biofilms that grow on different coatings and developed methods to image biofilms in conjunction with measuring the drag penalty.

There are complexities in imaging biological samples, and thorough imaging can be quite time consuming. The invention of generative AI, we believe, could help us to build an image dataset of a range of different biofilms that we wouldn’t otherwise be able to collect from imaging real-world biofilms alone.

As generative AI is a relatively new technology, we haven’t attempted to use it for this application, as such, we are wanting to explore it’s potential.

Applications will be accepted through Flexigrant.

Closing date: 17 February

Confirmation of place: 24 February