Skip to main content

Building and analysing cohorts within the OMOP common data model


Making clinical studies easier for the analyst, and faster for the clinician. 


Project overview

When conducting medical studies one of the most challenging aspects is obtaining the data with issues of privacy; the number of patients required; as well as research results needing to be drawn from many disparate data sources and then compared and contrasted. To extract data for analysis purposes traditionally requires strict data use agreements. Thus, a common data model (CDM) is used to alleviate this need by eliminating this extraction step. However, retrieving these cohorts can be challenging, and can be made easier. 

The focus of this project was to make this process easier for the analyst, by creating a cohort creation library in R, a document detailing the cohort creation process, and analysis scripts. Thus, increasing the rate of turnaround from the clinician to the analyst.

Data and methods  

The data used during this project was a random 50k subset from the connected data Yorkshire observational medical outcomes partnership (OMOP) common data model, where all individuals were from the Bradford area. 

Using this data a variety of cohorts were created to develop the functions necessary including patients suffering from Asthma, patients suffering from type ii diabetes, patients with pre-diabetes, childhood autism, whole 50k cohort, patients with Chronic Obstructive Pulmonary Disease (COPD) with an exacerbation in the last 12 months.

Using SQL, and working through these cohort examples three functions were made to retrieve cohorts from the OMOP database. The purpose of these functions varied, with one to find patients suffering from a certain condition, and one to find patients taking prescriptions. The final function made was a standard function used to make it easier to run SQL queries on the OMOP CDM. Analysis scripts were also built for the cohorts that had sufficient data.

Finally, a document was made which records the process of cohort creation within the OMOP CDM. This will make analysis quicker, and minimise the risk of miscommunication between the clinician and the analyst. The document explains the OMOP CDM overall and goes into a detailed description of how to build cohorts using both the functions made and standard SQL queries.

Key findings 

The findings of the project were varied. Mostly standard analytical graphs were outputted for the analysis scripts. For example; it was found that the Asthma diagnoses were increasing throughout the years, and the prescribing of preventer inhalers was increasing much faster than the number of steroid inhalers being prescribed. The main finding was that the overall cohort building was a lot faster once the functions had been built. 

Value of the research 

The main value of this research is the improvement in overall speed of cohort building and analysis. The document made will alleviate a lot of the miscommunications between the clinician and the analyst, as it explains a lot of the key fundamentals about the OMOP CDM for building cohorts, and alleviates the need for full OMOP CDM understanding, as the document is specific to cohort building only.

Quote from project partner

Creation of cohorts in the Connected Bradford dataset is a vital part of much of the work we are doing. Particularly in our work with the NIHR Patient Recruitment Centre it will help to improve access to research studies for all citizens of Bradford.” 

Dr Tom Lawton (Bradford Institute for Health Research)

“The guidance is a valuable resource to enable researchers and analysts to access, develop and analyse patient cohorts.” 

Kuldeep Sohal (Bradford Institute for Health Research)

Insights (from analysis scripts)

  • The majority of Asthma patients have been diagnosed at an early age. 
  • Rates of asthma diagnosis have been increasing year on year.
  • Diabetes diagnoses also often suffer from Angina pectoris (chest pains), Asthma and heart disease.

Research theme

  • Healthcare Research
  • Healthcare Informatics


James Lazarus (LIDA Data Scientist Intern)

Dr Tom Lawton (Bradford Institute for Health Research)

Kuldeep Sohal (Bradford Institute for Health Research)


Bradford Institute for Health Research