Rizwana Uddin, Dr Nik Lomax, Dr Nick Hood – University of Leeds. This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Digital Twins: Urban Analytics” theme within that grant & The Alan Turing.

In 2017, GP Registers for Leeds counted 60,000 more than the Office for National Statistics estimated to be living in Leeds. This project has assessed how these discrepancies occur and gave further understanding as to why they are occurring.

Project overview
Since the 2011 Census, data gathered concerning population estimates of Leeds through counts of people registered with GPs has largely differed with the population estimates obtained from Mid-Year Estimates (MYEs) published by the Office for National Statistics (ONS). The importance of large discrepancies across particular areas in Leeds has implications for many areas involving city planning such as health planning, transport planning, and election preparation. A single agreed version is highly desirable. This project has used Geographical Information Systems and classification methods to assess where the discrepancies exist within Leeds and gave further understanding as to why these discrepancies are occurring, indicating potential recent changes in the population composition of Leeds which are unaccounted for by the MYEs.

Data and methods
A classification was conducted of the UK, using variables derived from the 2011 Census outputs, to recognise demographic patterns across the UK and how these influence the disparity between population estimates from MYEs and GP registers.

Variables were selected to reflect themes including: Age, Ethnic Group, UK Migration and Social Grade. Highly correlated variables were removed to reduce multiple collinearity. Counts were covered to percentages for Lower Super Output Area (LSOA).

K-means was performed to produce 7 clusters of the data using the selected 10 variables as inputs. The city of London was not included in the final classification model due to the unique attributes that solely occur in London influencing other clusters. This optimal number of clusters for creating the final clustering was determined using the elbow method.

Geographical Information Systems were used to map cluster locations across the UK at LSOA level. GIS was also used to analyse patterns of difference percentage between population estimate counts across Leeds.

An optimum number of 7 clusters was found using k-means; these clusters were visualised using heatmaps of percentages of the variables. Each cluster displayed distinct characteristics of the population.

The classification of the UK presented here has highlighted a reoccurring pattern of higher GP counts occurring across the UK, which appear to be more pronounced in diverse clusters. This indicates that differences between population estimates is a wider problem occurring across the UK.

Leeds, however, is unique to the rest of the UK as it displays a higher frequency of LSOAs which contain demographics that could be driving the disparity between MYEs and GP registration counts, suggesting that ONS methods of collecting population estimates in certain areas require reviewing. As cluster distribution across Leeds reflects patterns of discrepancies of population estimates, this gives some indications that particular groups may have a larger influence over population estimates.

Figure 1Geographic locations of clusters in Leeds

Figure 2  – Outlier Percentage difference between the 2017 MYE and 2017 GP across the whole population of Leeds

Value of the research
Understanding where the differences in population estimates occur is able to contribute to aiding a single agreed version of population estimates required for aiding city planning services.

Research theme

  • Demography
  • Internal Migration
  • International Migration


  • Leeds City Council
  • The Alan Turing Institute

This project was undertaken as part of the LIDA Data Scientist Internship Programme.