Causal Inference Frameworks for Individual Based Models

Date: Monday 28 April 2025

This project is building towards a framework that integrates DAG-based causal inference with agent-based modelling. We aim to provide a novel approach for simulating complex systems by combining causal structure with individual-level interactions.

Project overview

Most human beings have an intuitive understanding of causation, but its complexity remains largely unexplained. To improve public health outcomes, we need a causal understanding of how individual behaviours influence population health trends.

For decades, epidemiology has focused on isolating single causes of diseases through analytical methods. However, understanding health and disease requires recognising how biological, behavioural, and societal factors interact dynamically over time.1 People move through space and time, constantly interacting with others, shaped by economic, social, geographic and political factors. To improve population health trends, we need to grasp how individual behaviour and decision-making connect within broader patterns and understand the spatial and temporal scales at which these operate.

To address this, many methods have been developed across different fields. In epidemiology, the main approach to causal analyses relies on statistical regression models guided by graphical causal models such as the Directed Acyclic Graph (DAG). These models help evaluate counterfactual contrasts, making it possible to identify and estimate causal relationships accurately.2 In recent years, agent-based models (ABMs) and microsimulation models (MSMs) have emerged as promising tools for causal inference evaluations within complex systems. Agent Based Modelling (ABM) is an effective tool for analysing exposures in a virtual environment, but their reliability in estimating robust causal effects needs exploration. DAG-based regression modelling mostly focuses on capturing fixed effects by modelling mean structures, whereas simulation approaches can lean into complexity by prioritising random structures.2

MSMs and ABMs are effective tools for analysing multiple measures of an exposure in an “in silico” (digital) environment. However, the conditions under which they reliably estimate causal effects need to be better understood. This project builds on previous research to integrate a form causal framework with the ABM/MSM tools for simulating longitudinal data with a priori causal structures.

Through this project we aim to better understand how synthetic datasets might be better simulated to ensure that appropriate causal structure is embedded, to identify the essential components in a synthetic dataset that are needed for building causally informed ABMs.

Data and methods

In this project, we developed a causally informed Agent-Based Model (ABM) designed to simulate the spread of an infectious disease. Our approach involved building a model for a hypothetical scenario that integrated causal relationships to better capture the complexity of disease transmission. We generated synthetic data informed by these causal structures, allowing us to explore how various factors interact and influence outcomes at different scales. By grounding our model in principles of causal inference, we aimed to enhance the robustness and accuracy of the simulation for better understanding disease transmission and dynamics within a population.

ABMs usually incorporate random interactions based on naïve assumptions; this project integrates a causal structure informed by a DAG to direct agent behaviour.

Figure1: The Directed Acyclic Graph (DAG) generated for this project. This DAG represents the exposures and outcomes in the hypothetical scenario and the variables affecting these. The subscript 0 represents time (t0) which is the point of initiation, subscript 1 represents the first time point (t1), subscript F represents (tF) the final time-period when the model stabilized. The double pointed arrow notation represents that those variables have a causal impact on all subsequent variables.

First, a DAG was created (Fig. 1) to represent the hypothetical scenario of infection spread within a population. A DAG is a graphical representation of causal relationships, specifying exposures and outcomes, and identifies the variables influencing these factors. The variables, exposures, and outcomes included in this DAG were selected based on domain-specific knowledge and background research, particularly focusing on their role in affecting individual susceptibility to infections. To maintain simplicity and clarity, only the most critical variables were chosen. The initial set of variables considered included age, sex, and ethnicity. These demographic factors have an impact on all subsequent variables and directly influence comorbidities. Together, these four factors determine the initial number of infection cases at the start of the simulation, defined as time t0. The cases at the start (point of initiation t0) define the number of recovered and infected individuals at the next time-period (t1). These variables then define the immunity of an individual to getting reinfected.

These time periods will keep repeating iteratively until the simulation reaches a stable state (tF). All the variables included in this graph collectively influence the number of recovered individuals and the time taken for recovery when the simulation reaches a stable state. For simplicity, the number of deaths has not been considered as an outcome. We have assumed that the number of deaths would be minimal or negligible relative to the entire population. Since DAGs are not well-suited for modelling agent-to-agent interactions, we have intentionally minimized complexity to reduce discrepancies between the DAG and the agent-based model.

Next, synthetic data (n = 1500) were generated using a data-generating mechanism informed by the causal relationships outlined in the DAG. The DAG was constructed in R using the dagitty R package, specifying plausible causal paths and defining path-coefficients between variables.3 The dagitty R package was used for obtaining the correlation matrix, which provides the statistical relationships between variables as implied by the underlying causal structure. The simulation of synthetic data was conducted using the GenData package.4 The variables considered for each individual in the synthetic dataset were age, sex, ethnicity, and comorbidities. These variables informed the susceptibility of the individuals to infection and subsequent recovery. They determined the initial case numbers and influenced immunity levels, infection, and recovery counts in subsequent periods. Empirical correlation matrices were generated by simulating data directly from this DAG, which allowed us to establish a target correlation structure reflecting the intended relationships.

The raw simulated data were then transformed to ensure practical utility within the ABM. For setting up the ABM environment, NetLogo software was used due to its flexibility and ease of use for simulating individual agent behaviours.5 The basic framework for this model was adapted from the epiDEM (Epidemiology: Understanding Disease Dynamics and Emergence through Modelling) example available in NetLogo's model library.6 Within NetLogo, agents (representing individual people) were created based on the transformed synthetic dataset. Initial infection statuses of agents were not randomly assigned but determined explicitly using a logistic regression model. This logistic regression model was computed in an R environment within NetLogo, using the variables age, sex, ethnicity, and comorbidities as predictors for infection probability. The Simple R extension for NetLogo was used to integrate the R environment for the logistic regression calculations.7

At the start (tick 0 in NetLogo, which represents the point of initiation t0) and for subsequent ticks (tn), each agent's probability of infection was recalculated using updated data from the ongoing simulation. The process involved collecting the current state of all agents into a data frame, passing this data to R, running the logistic regression model, and retrieving the newly predicted infection probabilities back into NetLogo. These probabilities were then directly used within NetLogo to update the infection statuses of the agents, maintaining causal structure as informed by the DAG.

The simulation proceeds iteratively, with each tick representing a discrete time point (in weeks). At every tick, agents moved randomly and interacted with neighbouring agents. New infections and recoveries were determined using the previously calculated probabilities, updating agent states accordingly. Agents' status of susceptible, infected, or recovered was visually represented by colour-coding with red representing infected individual, white representing those uninfected and green representing those recovered.

Figure 2: The NetLogo model at setup. This represents the point of initiation (t0). individuals that are infected are shown as “red” and un-infected individuals as “white”.

This causally informed approach ensures the model's outputs reflect realistic epidemiological dynamics influenced by the defined factors of age, sex, ethnicity, comorbidities, and immunity. This simulation thus provides a causally structured representation of the hypothetical scenario.

For comparison purposes, the DAG-informed ABM was evaluated alongside two other models: a naïve ABM and a Microsimulation Model (MSM). The naïve ABM was constructed using the same synthetically generated data; however, agent interactions were based on the assumptions typically input rather than informed by the DAG. Specifically, agents were assumed to be susceptible to infection based on demographic criteria. For example, individuals younger than 18 years or older than 65 years were considered inherently more vulnerable. This assumption was derived from general background research rather than causal relationships explicitly informed by the DAG.

The MSM was created to specifically remove the spatial constraints inherent in the ABM. In the DAG-informed ABM, uninfected agents become infected only if they come into contact with an infected agent within a defined radius of one patch (where a patch represents the basic unit of distance measurement in NetLogo). In contrast, the MSM eliminates this spatial restriction, allowing interactions to occur without the one-patch radius limitation. Further details and the NetLogo code are added to the project GitHub repository.

Key findings

Our findings indicate that integrating a DAG-informed causal framework into the agent-based model enhances its ability to capture the complexity of disease dynamics. By generating synthetic data that adheres to a target correlation structure defined by key epidemiological factors, it is ensured that the ABM accurately reflects the intended causal relationships. Compared with both the naïve ABM and the MSM, the DAG-informed approach produced results that more closely aligned with our a priori assumptions, offering a robust representation of disease-transmission dynamics over discrete time points.

Figure 3: NetLogo simulation of the DAG-informed ABM. The ABM shows white coloured agents representing individuals who remain uninfected for the duration of the simulation.

A particularly interesting observation was that in the DAG-informed model, some individuals remained uninfected throughout the simulation. This result mirrors real-world patterns where not everyone contracts a disease. In contrast, the naïve model demonstrates complete population infection, with every agent eventually becoming infected, suggesting a lack of inherent protective factors or heterogeneity in agent characteristics. This underscores the limitations of simplistic models that rely purely on random interactions without accounting for underlying causal relationships.

Table 1: Showing the result metrics for each model.

Metric	Naive ABM	DAG-informed ABM	MSM
Initial Infection Rate (%)	5.1	51.2	52.2
Infection Rate (%)	91.1	46.4	63.0
Never-Infected count	0.0	75.0	2.0
Recovery Rate (%)	100.0	95.0	99.8
Simulation Runtime (seconds)	60.3	29.1	24.8

Figure 4: Results charts for each model.

In the DAG-informed model, the infection rate peaks early before gradually declining as recovery rates surpass new infections, eventually reaching a stable state. This model captures the natural progression of a disease outbreak; one possible explanation is that accounting for immunity within the population contributes to slowing transmission. Meanwhile, the MSM produces smoother curves indicating a more generalized spread due to the removed spatial constraints.

The recovery curve in the naïve model follows a normal distribution, as explicitly defined in the model’s code, rather than being influenced by individual agent characteristics and interactions. This approach simplifies recovery dynamics by assuming that recovery follows a predetermined statistical pattern, independent of agents’ demographic attributes or their interactions with others. Consequently, this model overlooks potential heterogeneity in recovery rates arising from factors such as age, sex, comorbidities, etc.

The DAG-informed ABM and MSMs start with a higher initial infection rate due to the stronger variable path coefficients chosen during data generation. This approach was intentional, as it allowed for shorter run times to ensure efficient execution in NetLogo on a local system. The 50% initial infection rate was selected to achieve model stability efficiently, this does not affect the broader conclusions drawn from comparing different approaches. The models can be directly compared despite differences in time to reach stability, as these variations are primarily due to initial interactions and assumptions, not differences in temporal granularity. Since this is a proof-of-concept model, the path coefficients were adjusted for practicality, but they can be modified as needed to create more refined and situation-specific models.

One limitation of this model is that deaths have not been accounted for, which could impact the accuracy of the simulation, particularly if mortality is a sizable risk. Additionally, DAGs are inherently better suited for understanding causal relationships at a population level rather than capturing detailed agent-to-agent interactions. In contrast, while DAGs excel at providing a structured overview of how various factors influence outcomes across a population, they are less effective at accurately representing individual-level interactions that drive disease spread. Integrating DAG-informed thinking with ABM strategies brings the benefits of both approaches. Overall, the findings highlight the strengths and limitations of integrating causal structures within simulation models, underscoring the need for further research to explore this approach more comprehensively.

Value of the research

The value of this research is its ability to make epidemiological simulations more reliable by adding causal structure to ABMs. By developing a DAG-informed ABM, this study demonstrates how causal inference methods can be applied to improve the robustness of disease transmission models in capturing multifaceted interactions that may not be captured in naïve models. This approach is especially valuable for simulating infectious diseases where individual-level heterogeneity and agent-to-agent interactions significantly influence outcomes.

This work serves as a first step in building a framework that could be applied to real-world public health problems, such as optimising vaccination strategies or understanding the spread of emerging diseases. By bridging the gap between statistical causal models and simulation-based approaches, this research contributes to a broader effort to build more reliable tools for public health decision-making. However, this model is a proof of concept, and further work is needed to refine and expand this approach. Future directions for this work could include incorporating real-world geographic structures and multilevel hierarchical data to enhance model accuracy and applicability.

Insights

Novelty: This study is among the first attempts to integrate DAGs with ABMs to enhance the robustness of disease transmission simulations. It serves as a building block for combining causal inference techniques with ABMs.
Value: This study establishes a framework integrating DAGs and ABMs, allowing for more robust modelling of disease spread which accounts for complex interactions between demographic factors and individual-level behaviours.
Future Directions: Incorporating real-world geographic structures and multilevel hierarchical data could enhance the model’s accuracy and applicability. This research can also provide a useful framework for studying causal inference in broader contexts, such as modelling human mobility patterns more generally.

Research theme:

Identify which LIDA research theme(s) this sits under:

Health
Societies

Programme theme:

The Science of Data Science
Data Science Infrastructures

Team:

Lynette Linzbuoy, Data Scientist, Leeds Institute for Data Analytics
Dr Jiaqi Ge, School of Geography, University of Leeds
Prof Mark S Gilthorpe, Professor of Statistical Epidemiology, Obesity Institute, Leeds Beckett University
Prof Alison J Heppenstall, Professor of Geocomputation, School of Political and Social Sciences, University of Glasgow

Partner:

Geolytix

Funder:

This project was funded by the Leeds Social Sciences Institute ESRC Impact Acceleration Account funding ring-fenced for the Data Scientist Development Programme 2024-25.

This work has been facilitated by the Leeds Institute for Data Analytics (LIDA) Data Scientist Development Programme, which employs early-career data scientists to deliver real-world data-driven impact in the interests of the public good.

Conflicts of interest:

Prof Mark S Gilthorpe is a director of Causal Insights Solutions Ltd, which provides causal inference training and may benefit from any study that demonstrates the value of causal inference methods. All other authors have no conflicts of interest to declare.

References

Galea S, Riddle M, Kaplan GA. Causal thinking and complex system approaches in epidemiology. Int J Epidemiol. 2010;39(1):97-106. doi:10.1093/ije/dyp296
Arnold KF, Harrison WJ, Heppenstall AJ, Gilthorpe MS. DAG-informed regression modelling, agent-based modelling and microsimulation modelling: a critical comparison of methods for causal inference. International Journal of Epidemiology. 2019;48(1):243-253. doi:10.1093/ije/dyy260
Textor J, van der Zander B, Gilthorpe MS, Liśkiewicz M, Ellison GT. Robust causal inference using directed acyclic graphs: the R package ‘dagitty.’ International Journal of Epidemiology. 2016;45(6):1887-1894. doi:10.1093/ije/dyw341
Huang F. gendata: Generate and Modify Synthetic Datasets. Published online May 9, 2022. https://cran.r-project.org/web/packages/gendata/index.html
Wilensky U. NetLogo. 1999. https://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.
Yang C, Wilensky U. NetLogo epiDEM Basic model. 2011. http://ccl.northwestern.edu/netlogo/models/epiDEMBasic. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.
Hovet J, Head B, Wilensky U. Simple R NetLogo extension. GitHub. 2022. https://github.com/NetLogo/SimpleR-Extension. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.