QuantiCode was funded under the EPSRC Making Sense of Data call to develop data mining and visualization techniques that transform people’s ability to analyse quantitative and coded longitudinal data. Such data are common in many sectors. For example, health data is classified using a hierarchy of hundreds of thousands of Read Codes (a thesaurus of clinical terms), with analysts needing to provide business intelligence for clinical commissioning decisions, and researchers tacking challenges such modelling disease risk stratification. Retailers such as Sainsbury’s sell 50,000+ types of products, and want to combine data from purchasing, demographic and other sources to understand behavioural phenomena such as the convenience culture, to guide investment and reduce waste.
We adopted a co-design approach to our research, with external partners that included Leeds City Council, NHS Digital, Sainsbury’s, Bradford Institute of Health Research, and Consumerdata Ltd. They provided datasets that were central to the design and evaluation of QuantiCode’s novel data mining and visualization techniques, contributed domain expertise, and worked with us in more than 50 meetings and workshops over the project’s duration to define exemplar application scenarios, provide feedback and help to refine the techniques that we developed.
The project was highly multi-disciplinary, making contributions in diverse areas as summarised below.
Temporal pattern mining
The widespread collection of data in social, health and consumer contexts has contributed to the availability of complex temporal datasets. Data instances collected in these datasets are characterized by a variable number of irregularly spaced timestamped events that include information about continuous activities and instantaneous events (e.g. Electronic Health Records, machine log files, credit/debit card use). Those datasets cannot be analysed using classical statistical, machine learning and data mining methods due to variable lengths of records. Temporal pattern mining is one of the popular but resource intensive method for extracting information from such datasets. In many datasets, particularly those which collect information input manually, timestamps often have errors (for example, a patient recorded wrongly to have a post-surgery check before the actual operation) thus invalidating analysis based on standard temporal mining methods. Quanticode’s researchers designed fast parallelisable algorithms that are robust to timestamp recording errors6,7 and developed their multi-core implementations5,6.
Quanticode’s tool has been evaluated on publicly available datasets5,6 and on an adult social care dataset6,7. Interestingly, the presence of timestamps was identified as a threat to anonymisation process as knowledge of only a few timestamped events was shown sufficient to uniquely identify individuals7. This can be alleviated by artificially introducing noise to timestamps before sharing the datasets. Information can be then extracted using our robust temporal pattern mining tool.
Automated Machine learning
Training of statistical models depends on a large number of metaparameters which determine not only the accuracy of the resulting model but also the running time. The parameters in question could be the maximum depth of a tree in a random forest or the size of a training sample drawn from a large dataset. Quanticode’s research focussed on finding an optimal choice of metaparameters for a given dataset and a modelling objective given that the relationship between the metaparameters and the accuracy and running time is not known in advance but must be learnt. This learning and metaparameter selection must be accomplished in a very limited time – the requirement imposed by the demands of the interactive work of researchers with the data. Quanticode’s researchers developed theoretical foundations and a Python tool that can be applied for metaparameter selection in any user specified modelling framework.
QuantiCode’s visualization research has focussed on techniques for investigating data quality, which is very under-researched. Our contributions lie in three areas.
First, we developed a highly-scalable set visualization tool for investigating patterns of missing values. The tool has been evaluated with large electronic health record (EHR) datasets (for details, see “Medical informatics” below).
Second, we demonstrated the manifold benefits of combining sparklines (a matrix of hundreds of miniature visualizations) with perceptual discontinuity (e.g., giving bars a minimum height) and semantic encoding (using multiple mark types to explicitly encode the differences between certain, important values). Example benefits from an evaluation with primary and secondary care datasets included1:
- The padding out of fields to a fixed length, which obscured wide variation between data tables of the coding precision of diagnoses.
- Unexpected longitudinal patterns of missing values in hospital data.
- Identifying inconsistent encryption of patient IDs, which prevented records from one data extract from being linked with others.
- Widespread usage of invalid characters in the values for patients’ diagnosis codes.
Third, we have produced a 22-minute film about Visualizing the Quality of Data. The film is freely available on YouTube (https://tinyurl.com/VizDataQuality) and uses case studies from healthcare and retail to explain: (a) data completeness and correctness, (b) how we can visualize them, and (c) why they matter.
Visual analytics combines the approaches of automated data mining (including machine learning) and interactive visualization to address problems that cannot be solved by either approach on its own. Visual analytics achieves this by leveraging human beings’ unparalleled ability for detecting and reasoning about patterns in conjunction with the computational power of high-performance computing and data mining.
QuantiCode’s contributions are in two areas. First, we have investigated how multiple event mining methods may be combined to produce a step-change in the complexity of data that may be analysed. That research has involved application domains that range from cybersecurity to retail analytics2.
Second, we have developed a proof of concept tool for designing random forest models. This has been applied to: (a) predicting adult social care needs, and (b) understanding losses in diagnostic persistence for diseases such as autism and dementia from EHRs.
Quanticodes initial work on the ethics of data science began with introductory training workshops setting out some initial principles for the ethics and governance of data science research, developed for delivery to data scientists.
This enabled the inter-disciplinary team to work together to produce a 25 page Quanticode Data Governance and Ethics report and associated presentation. This presented a framework to enable users to understand the ethical issues involved in this and similar projects. While ethical issues surround the collection, processing, storage and use of data, the nature of QuantiCode was such that, at this stage, we focused primarily on processing and use, because Quanticode tools would analyse data that had already been collected and stored.
To share our innovations in ethics and governance, in June 2017 we held an inter-disciplinary and multi-disciplinary two day Ethics and Governance workshop. This brought together leaders in the field such as Onora O’Neill (Philosophy), Andrew Dyson (Global lead partner in data law and governance at DLA Piper), the Information Commissioner’s office, members of the Quanticode team – Prof Justin Keen, Dr Kevin Macnish, Dr Anna , Prof Roy Ruddle, and industry partners (NHS England, Sainsbury’s, Born in Bradford).
In due course this led to a multi-authored review paper “Machine learning and cybernetic governance: a health and social care case study”, (lead author Professor Justin Keen.)
In stage two of the work programme, Quanticodes next contribution was to develop two suites of training materials, one set being ethics for data scientists, one set ethics for DPOs. These materials can be used in face to face and online teaching/training.
NHS Digital has a mature, well-documented process for cleaning EHR data. The process includes duplicate removal, the application of correction and validation rules, and the provision of feedback to hospitals and other healthcare providers. Our collaboration with NHS Digital focused on admitted patient care (APC) data, which is a type of EHR. NHS patients generate tens of millions of APC records every year, and each record contains several hundred fields. Missing data is automatically validated for approximately half of the fields, and replaced by special codes (e.g., 01/01/1800 for dates). The quantity of missing values is checked for a handful of the fields (typically fewer than 10), and a data quality issue is opened with a provider if the percentage of missing values exceeds a threshold (e.g., 30%).
QuantiCode developed a novel and highly-scalable set visualization tool for investigating patterns of missing values. It has been applied to APC datasets that each contained 15 – 20 million records and approximately 1 billion missing values, to reveal several important and previously unknown patterns. Those patterns include:
- Gaps in the 24 fields that record a patient’s operations, which has implications for the NHS Payment by Results system.
- Inconsistencies between operation fields and the corresponding 24 date fields, which affected millions of records.
- Gaps in the 20 fields that record a patient’s diagnoses, which affects the data cleaning methods used by epidemiologists. The origin pointed to a particular unit in a specific healthcare provider, allowing feedback to be given so that the problem could be rectified.
The project also developed a simple method for investigating integrity errors in maternity records about the location of a baby’s delivery, allowing feedback to be given to specific healthcare providers. Finally, through the above research we identified an error in the APC Data Dictionary for OPERTN_nn fields.
In retailing a customer mission is their purpose for making a given set of purchases. Surveys show that most retail transactions fall into one of a small number of missions, with examples from food retailers being “eat now” and “food for a couple of days.” Retailers typically classify missions by using statistical models that are based on clustering and include factors such the number and type of products, time of day and cost. However, the models fall well short of 100% accuracy.
QuantiCode investigated how set-based visualization techniques may be used to analyse customer missions3, and developed that into a visual analytics workflow that combined visualization with exclusive set intersection and high utility itemset mining techniques4. We have also produced a detailed report about productionising the workflow.
An ageing population, often living with complex conditions, means that social and health care needs have risen in the past few decades. Although health care has taken a prominent position in research and popular media, adult social care is equally important to ensure the wellbeing and continued independence of people and to reduce costly NHS provision.
QuantiCode investigated how robust temporal mining, model interpretation and visualization techniques may be used to to identify people at risk of losing independence and moving to residential care8. Results sparked interests from councils and led to a further 12 month impact project aiming to perfect those methods and extend them to recommend effective individual pathways of care.
Dr Roy Ruddle (PI; School of Computing), Prof Mark Birkin (School of Geography), Dr Jan Palczewski & Dr Georgios Aivaliotis (School of Maths), Prof Sir Alex Markham (Leeds Institute of Biomedical and Clinical Sciences), Prof Justin Keen (Leeds Institute of Health Sciences), and Prof Chris Megone and Dr Kevin Macnish (Inter-Disciplinary Ethics Applied Centre).
The QuantiCode project (2016 – 2020) was funded by the EPSRC (EP/N013980/1; EP/K503836/1) and supported by the MRC (MR/L01629X/1) and the ESRC (ES/L011891/1).
For any enquiries, please contact: Prof. Roy Ruddle (PI), firstname.lastname@example.org
1Ruddle, R.A. and Hall, M.S., 2018, November. Using Miniature Visualizations of Descriptive Statistics to Investigate the Quality of Electronic Health Records. In Proceedings of the 12th International Conference on Health Informatics. SciTePress.
2Adnan, M., Nguyen, P.H., Ruddle, R.A. and Turkay, C., 2019. Visual analytics of event data using multiple mining methods. In EuroVis Workshop on Visual Analytics (EuroVA) 2019 (pp. 61-65). The Eurographics Association.
3Adnan, M. and Ruddle, R.A., 2018. A set-based visual analytics approach to analyze retail data. In Proceedings of the EuroVis Workshop on Visual Analytics (EuroVA18). The Eurographics Association.
4 Kocanova, I., Adnan, M., Aivaliotis, G. and Ruddle, R.A., 2019. Visual analytics workflow for investigating customers’ transactions in convenience stores. Poster at IEEE Visualization Conference (VIS 2019).
5 Titarenko, S., Titarenko, V., Aivaliotis, G. and Palczewski, J., 2019, A constraint-based frequent pattern mining algorithm and its optimisation for multicore systems, In Proceedings of the 2019 Emerging Technology Conference, ISBN 978-0-9933426-4-6, pages 58-61
6 Titarenko, S., Titarenko, V., Aivaliotis, G. and Palczewski, J., 2019, Fast implementation of pattern mining algorithms with time stamp uncertainties and temporal constraints, Journal of Big Data, 2019, 6:37, DOI: 10.1186/s40537-019-0200-9
7 Palczewska, A, Palczewski, J., Kowalik, L., Aivaliotis, G., 2017, RobustSPAM for inference from noisy longitudinal data and preservation of privacy. In IEEE 16TH International Conference on Machine Learning and Applications
8 Palczewska, A, Palczewski, 2019, Risk stratification for ASC services, Report for Leeds City Council