Skip to main content

Scalable Visualisation and Explainability of Synthetic Datasets

Date

Introduction

Synthetic data holds the promise of enabling research and innovation without exposing sensitive information, but only if its quality and privacy can be rigorously demonstrated. The tool developed enables researchers and domain experts to easily compare real and synthetic tabular datasets, combining interactive visualisation, statistical validation and anomaly detection, privacy assessment and AI powered data quality assessment so that organisations can trust synthetic data in high‑stakes contexts. 

Project overview

Generating synthetic data is increasingly popular for privacy preservation and data augmentation, yet deployment remains hampered by the lack of transparent validation tools. Many existing approaches reduce quality to a single score or rely on ad‑hoc scripts, leaving non‑statisticians uncertain about reliability. This project, a collaboration between the Leeds Institute for Data Analytics, the Schools of Mathematics and Computer Science at the University of Leeds, and 4‑Xtra Technologies Ltd, was built to close this gap. The project is part of the broader MAVIS programme (EPSRC grant EP/X029689/1) focused on scalable visualisation for explaining machine‑learning models. 

The team set out to provide a unified, production‑ready workflow that: 

  1. Lets users explore univariate, bivariate and multivariate fidelity through interactive visualisations including anomalies in the data with a downloadable csv file which contains the identified anomalous points. 
  1. Implements well‑established statistical tests for distributional similarity and structural preservation, returning raw statistics rather than opaque composites, also privacy metrics and integrates AI agentic expert system engineered for interpretation of these results with a downloadable pdf report generated to enable the domain expert or researcher make sense of the results without being an expert. 
  1. Packages the solution with a modern React/D3 front end and a FastAPI backend with containerised deployment. By surfacing per‑test evidence and contextual explanations, it empowers practitioners to apply domain knowledge and governance rules rather than deferring to a “black‑box” score. 

Video 1: Demonstration of tool 

Data and methods

Inputs and preprocessing 

The tool operates on two CSV or excel files sharing the same schema: one containing real data and the other containing synthetic data. Typical demonstration datasets include mixed numeric and categorical columns. During ingestion the backend performs type checks, converts numeric strings, extracts ranges and categories and stores the datasets using SQLAlchemy ORM with SQLite database for reproducibility. 

Software Architecture 

Figure 1: Software architecture of the tool

Figure 2: Data upload and configuration interface

The frontend is built with React/D3. The backend, built with FastAPI, generates interactive documentation. UMAP and OpenTSNE are the machine learning algorithms used for dimensionality reduction. Also, a custom algorithm was developed for anomaly detection. The tool is packaged with Docker, so it runs the same on any machine and has an option to use a GPU to speed up UMAP and fits into automated testing and release workflows. Comprehensive documentation supports long-term use sustainability, transparency and adoption. 

Statistical tests 

To evaluate how closely synthetic data matches its real counterpart, the tool employs established tests from SciPy[5] and statsmodels[4]. SciPy extends NumPy with domain‑specific algorithms for optimisation, integration and other computations, providing a robust foundation for hypothesis tests. Univariate similarity is measured using the Kolmogorov–Smirnov (KS) test for continuous distributions, Welch’s t‑test for comparing means, and the chi square test for categorical frequencies. Correlation structures are assessed via element‑wise Pearson differences and the Jennrich test[2] for equality of correlation matrices. For multivariate behaviour, it computes energy distance, total variation distance and Kullback–Leibler divergence. Outliers are flagged using interquartile range (IQR) thresholds, and multiple comparisons are controlled using the Benjamini–Hochberg false discovery rate procedure[1]. Privacy is quantified via nearest‑neighbour–based metrics. It computes three privacy indicators: 

  1. Nearest‑to‑second‑nearest distance ratio (NNDR) – the ratio of the distance to a record’s nearest neighbour versus its second nearest neighbour in the real dataset. Low ratios indicate synthetic records that are closer to some real record than that record is to its own neighbours. 
  1. Nearest‑neighbour distance statistics – summarise the distribution of distances from each synthetic record to its nearest real neighbour. Smaller distances suggest potential leakage. 
  1. Exact‑match rate – the proportion of synthetic records that are verbatim copies of real records (row‑wise collision detection). 

These metrics run quickly on large tables, require no external privacy packages and provide intuitive indicators of proximity between synthetic and real records. The tool encodes categorical variables using one‑hot encoding before computation. 

Visualisation and dimensionality reduction 

Recognising that statistics alone can obscure nuance, The tool provides interactive visual comparisons. Histograms, bar charts and violin plots allow users to inspect univariate and bivariate fidelity. To explore multivariate structure, the platform employs non‑linear techniques, openTSNE and UMAP. openTSNE preserves local pairwise similarities, producing clear clusters but potentially distorting global geometry. UMAP constructs fuzzy simplicial sets to preserve both local and some global structure while offering faster computation. By offering multiple dimensionality‑reduction methods with configurable parameters, the tool enables users to check whether synthetic data falls within the same clusters or manifolds as the real data. The front end displays these embeddings as scatter plots with tooltips and lasso selection, and a sidebar updates statistical plots for the selected points. Users can then run the custom Anomaly detection algorithm and explore the anomalous areas visually. 

Figure 3: Anomaly detection and distribution comparison

 

Figure 4: Distribution comparison histogram for the charges variable

AI‑driven statistical results interpretation 

After running the tests, the structured results are passed to an AI agent that has been primed with descriptive prompts. The agent generates human‑readable summaries explaining which variables exhibit significant differences, where correlations drift, and whether privacy scores are acceptable and rates the overall risk and practical use of the synthetic data. The combination of quantitative output and narrative explanation makes it easier for non‑experts to interpret the results. 

Figure 5: AI expert analysis report screenshot

Key outcomes

Clear visuals for complex data 

The tool turns very high-dimensional data into easy-to-read maps using UMAP and openTSNE. The models are also saved and reused so results stay consistent across runs. The pipeline handles both numbers and categories and can use a GPU for these reductions when available. 

Thorough, reliable data checks 

The system checks whether synthetic data behaves like real data across seven areas: sensible ranges, distribution shapes, relationships between variables, privacy tests (to ensure no synthetic record is too close to a real one), outliers, overall quality, and multivariate patterns. We use standard statistics (for example, Kolmogorov–Smirnov and energy distance) and adjust for multiple tests to avoid false alarms. The result is a balanced view of quality and risk that teams can trust. 

Custom anomaly detection 

The tool includes a proprietary, explainable anomaly detector that looks for localised issues. It compares small slices of real and synthetic data using proportion-based tests, then highlights where the synthetic data deviates from what you’d expect. Findings are shown in a clear, color-coded view and backed by summary numbers (e.g., anomaly rates and affected regions), so you can see what’s wrong and why. The implementation is designed for production use and can process large datasets efficiently while maintaining statistical rigor. 

AI-powered reports 

An AI assistant turns raw results into clear summaries and practical recommendations. Reports include a short executive overview plus optional technical detail and can be exported to JSON and PDF. We use prompt engineering—structured prompts and templates—to keep tone, structure, and conclusions consistent and reliable. 

Simple, reliable product setup 

The tool is a modern web app with a React front end and a FastAPI back end. It’s packaged to run the same way in different environments, and it produces clean, structured logs that make it easy to operate and troubleshoot. 

Findings

A case study was done using a financial dataset(insurance) demonstrates how the tool’s embedding view and adaptive grid reveal anomalies that would be difficult to spot by looking at one column at a time. In this example, 1,000 real records and 1,000 synthetic records were sampled from a larger dataset (1,338 real and 100,000 synthetic rows). The synthetic data was generated by a proprietary 4‑Xtra model and then additional anomalies were manually added along the CHARGES and CHILDREN columns by inserting unrealistic values. 

On the UMAP projection, real (red) and synthetic (blue) points trace a winding manifold that captures high‑dimensional relationships among age, BMI, number of children, smoking status and charges. It overlays an adaptive histogram‑based grid on this projection and performs a binomial test in each cell to determine whether synthetic or real points are over‑represented relative to the global proportion. Cells where real points significantly outnumber synthetic points are shaded red, and cells where synthetic points dominate are shaded blue. In this run, the algorithm identified seven anomalous regions (shown by coloured cells), with the summary pane reporting the number of “nominators” (points falling into any coloured cell) and the count of anomalies in each class as shown in Figure 3. 

One of the blue cells in the lower left corresponds to a cluster where synthetic data heavily outweighs real data. Clicking this cell populates the distributions panel with a histogram of CHARGES for the selected points. The histogram shows that the synthetic records have far more entries in certain charge ranges than the real data; the synthetic bars (blue) dominate and extend into very high or unusual charge values, while the real bars (red) are sparse. This over‑representation is a direct consequence of the manual corruption introduced into the synthetic generator extreme or negative charges and mismatched numbers of children inflate the synthetic population in this part of the space. Conversely, several red cells along the manifold reveal areas where the real data has clusters that the synthetic model failed to reproduce; these correspond to realistic combinations of features that the synthetic model under‑sampled or omitted. 

The advantage of this approach is that it surfaces both univariate anomalies (e.g., unrealistic charge values) and multivariate structure (e.g., the joint distribution of children and charges) in a single interactive view. Traditional approaches such as scanning separate histograms for each column – might detect that charges have an unusual range, but they would not show that the anomalies occur in specific combinations of features or highlight where the synthetic model has missed real patterns. The tools adaptive grid uses false‑discovery‑rate‑controlled binomial tests to flag only statistically significant deviations, preventing visual clutter and false alarms. Analysts can quickly hover over coloured cells to see detailed statistics or download a CSV of the selected points. In practice, this allows data scientists and governance teams to identify and correct issues such as negative charges, implausibly high charges, or unrealistic correlations between the number of children and medical costs much more efficiently than by manually inspecting individual histograms. 

Overall, the findings from this insurance dataset show that, although the synthetic data preserves many categorical distributions (e.g., SEX, SMOKER, REGION) and some numeric ranges, it diverges substantially in CHARGES and CHILDREN. The embedding view makes these divergences obvious: synthetic data clusters in regions with extreme or impossible charges, while missing some real‑world patterns. Such insights underscore the importance of integrated, multi‑dimensional validation tools like this for detecting subtle anomalies and guiding targeted improvements to synthetic data generators. 

The tool offers a pragmatic bridge between synthetic‑data research and real‑world deployment. For researchers, it eliminates the need to cobble together separate scripts for statistics, plotting and privacy; instead, a unified workflow produces comprehensive evidence and commentary. For organisations, the system builds trust: by exposing per‑test statistics, p‑values and effect sizes, it respects domain‑specific thresholds and regulatory scrutiny. Legal and compliance teams can examine each measure and apply their own risk frameworks rather than relying on an opaque score. For the wider research community, it demonstrates how to operationalise academic methods:  It leverages established libraries (SciPy, statsmodels, scikit-learn[3], openTSNE) and implements fast custom privacy checks (NNDR, nearest-neighbor distance, exact match), uses modern web technologies for interactive visualisation, and packages everything in reproducible containers. As a result, the tool accelerates adoption of synthetic data by making quality assessment transparent, repeatable and scalable. 

Importantly, the project contributes to responsible AI practices. It underscores that synthetic data should not be blindly trusted; rigorous validation is essential to avoid encoding biases or unrealistic artefacts. By integrating privacy metrics alongside utility tests, it emphasises balanced evaluation. Its AI‑driven interpretation shows how language models can assist domain experts by translating statistics into narrative insights rather than replacing expert judgement. Finally, interactive visualisations foster understanding among non‑technical stakeholders, promoting a culture of data literacy and evidence‑based decision making. 

Quote from partner

“The goal of this collaboration was to build an effective, visual-first workflow for comparing real datasets with synthetic counterparts produced by generative AI or synthetic data generator models. This need arises during the training and evaluation loop of such models, which is currently underserved in today’s GenAI ecosystem. 

The output of this project addresses that gap by closing the loop for developers. It enables AI engineers and Data Scientists to pinpoint problematic slices of synthetic data in minutes, channel targeted feedback into model training, and iterate with confidence. As GenAI and synthetic data reshape entire markets, transparent, repeatable validation is what turns promising research into deployable, trustworthy products.”  

– Lukas Čironis, Principal Data Scientist, 4-Xtra Technologies Ltd. 

Insights

  • Unified workflow: The tool brings ingestion, validation, visualisation, privacy assessment and reporting into a single process triggered by a single button. 
  • Transparency: Rather than producing a single “quality score,” the platform returns raw statistics, p‑values, effect sizes and false‑discovery‑rate flags, allowing experts to apply their own thresholds. 
  • Interactive exploration: Users can hover over points, select subsets and see updated histograms and correlations, making it easier to pinpoint where synthetic data diverges from the real. 
  • Privacy awareness: Integration privacy metrics ensures that privacy risks are assessed alongside utility. 
  • Operational readiness: Modern frameworks (React, D3, FastAPI), optional GPU acceleration, asynchronous job processing and Dockerised deployment make the tool suitable for research and industry. 
  • Extensibility: Built on open‑source libraries, the platform can be expanded to new data types or integrated with synthetic‑data generators for closed‑loop improvement. 

Figure 6: Embedding history page

Research theme

This case study aligns with multiple Leeds Institute for Data Analytics themes: Health and Societies, given the use of healthcare and socio‑economic datasets; Visualisation/Extended Reality because of its interactive visual analytics; Data Science Infrastructures due to its scalable backend; and Artificial Intelligence/The Science of Data Science via its AI‑driven interpretation and methodological contributions. 

Programme theme

Within the Data Scientist Development Programme, the project primarily advances the Visualisation/Extended Reality theme through its exploratory front end and the Data Science Infrastructures theme via its containerised backend. The integration of AI for report generation also relates to the Artificial Intelligence theme. 

Team

  • Netochukwu Onyiaji – Data Scientist, LIDA, University of Leeds. 
  • Dr Leonid Bogachev – Reader in Statistics, University of Leeds. 
  • Prof. Roy Ruddle – Professor of Computer Science, University of Leeds. 
  • Dr Liqun Liu – Research Fellow in Computer Science, University of Leeds. 
  • Dr Lukas Čironis – Principal Data Scientist, 4‑Xtra Technologies Ltd. 

Partners

This project was undertaken jointly by the Leeds Institute for Data Analytics, the Schools of Mathematics and Computer Science at the University of Leeds, and 4‑Xtra Technologies Ltd. 

Funder acknowledgement

This work forms part of the Making Visualisation Scalable (MAVIS) for Explaining Machine Learning Models project funded by the Engineering and Physical Sciences Research Council (EPSRC grant EP/X029689/1) and is supported by the Leeds Institute for Data Analytics (LIDA) Data Scientist Development Programme, which provides early‑career researchers with opportunities to deliver data‑driven impact for public good. 

References

  1. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. 
  2. Jennrich, R. I. (1970). An asymptotic chi‑square test for the equality of two correlation matrices. Journal of the American Statistical Association, 65(330), 904–912. 
  3. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, É. (2011). Scikit‑learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
  4. Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modelling with Python. Proceedings of the 9th Python in Science Conference (SciPy2010), 57–61. 
  5. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., … & SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17(3), 261–272.