Missing data is the most common data quality issue in electronic health records (EHRs). Missing data checks implemented in common analytical software are typically limited to counting the number of missing values in individual fields, but researchers and organisations also need to understand multifield missing data patterns to better inform advanced missing data strategies for which counts or numerical summaries are poorly suited. This study shows how set-based visualisation enables multifield missing data patterns to be discovered and investigated.

The research was conducted by a cross-disciplinary team of computer scientists and health epidemiologists from the Leeds Institute for Data Analytics (LIDA) at the University of Leeds, with help from staff at NHS Digital. The team was led by LIDA Director of Research Technology and Alan Turing Institute fellow, Professor Roy Ruddle.

The researchers developed novel set visualization software called ACE, which combined interactive visualization and data mining to enable users to find and explain patterns of missing values, irrespective of whether those patterns were widespread or rare.

ACE was applied to an anonymised 16 million record admitted patient care dataset from NHS hospitals. According to a state of the art review published this year[1], that dataset is larger than any used in previous research that aimed to develop novel health record visualization methods.

At present, researchers and data scientists typically only check individual fields for the number of missing values, and then remove records with missing values, drop variables or impute values. ACE’s bar charts, histograms and heat maps transform our capability to investigate multi-field missing data patterns. The importance is shown by the large number of fields (up to 75) that were often involved in the patterns.

An example of ACE’s power is as follows. An overview visualization (a bar chart showing the number of missing values in each field) shows the expected structure for patients’ diagnoses (a staircase or “monotone” pattern, which occurs because most patients have only one condition but a few have complex health conditions). However, when a heatmap was used then other, rare and completely unexpected multi-field patterns immediately “popped” out. Those rare patterns shared the structural characteristic of having “gaps”, and ACE’s interactive data mining allowed the researchers to pinpoint that the majority of those gaps originated from one particular part of a specific hospital. The researchers found similar structures in the fields that stored data about patients’ operations, which potentially affects NHS healthcare resource groups and payments to hospitals.

The team used anonymised admitted patient care health records for National Health Service (NHS) hospitals and independent sector providers in England. The visualisation and data mining software was run over 16 million records and 86 fields in the dataset.

The results showed that the dataset contained 960 million missing values. Set visualisation bar charts showed how those values were distributed across the fields, including several fields that, unexpectedly, were not complete. Set intersection heatmaps revealed unexpected gaps in diagnosis, operation and date fields because diagnosis and operation fields were not filled up sequentially and some operations did not have corresponding dates. Information gain ratio and entropy calculations allowed us to identify the origin of each unexpected pattern, in terms of the values of other fields.

Our findings show how set visualisation reveals important insights about multifield missing data patterns in large EHR datasets. The study revealed both rare and widespread data quality issues that were previously unknown, and allowed a particular part of a specific hospital to be pinpointed as the origin of rare issues that NHS Digital did not know exist.

Citation and funding:

Ruddle RA, Adnan M, Hall M. Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data. BMJ Open 2022;12:e064887. doi: 10.1136/bmjopen-2022-064887 (https://bmjopen.bmj.com/content/12/11/e064887)

Funding: Engineering and Physical Sciences Research Council grant numbers EP/N013980/1 and EP/K503836/1, the British Heart Foundation grant number PG/13/81/30474 and the Alan Turing Institute.

Software

ACE (standing for Analysis of Combinations of Events) is Java software that is freely available from https://doi.org/10.5518/1150. ACE can be applied to investigate both missing data and generic set-typed data (e.g., to understand customer transactions in shops; Adnan & Ruddle, Proceedings of EuroVA 2018, https://diglib.eg.org/bitstream/handle/10.2312/eurova20181110/037-041.pdf)

Further Information

Contact Professor Roy Ruddle. Tel: 0113 343 1711. Email: r.a.ruddle@leeds.ac.uk