Skip to main content

Socio-Economic Patterns of Household Waste Generation

Date

Urbanization, driven by industrialization and migration to cities, presents significant challenges, particularly in waste management. Rapid increases in municipal solid waste, projected to double globally by 2025 (UN Habitat, 2019), highlight the urgency of addressing this issue. In the UK, approximately 10.7 million tonnes of food waste are produced annually, with households contributing over 60% of this figure (Xameerah et al., 2024). This waste incurs financial losses of around £250 per individual and contributes to about 3% of the UK's greenhouse gas emissions (Waste and Resources Action Program, 2023). Effective waste management is therefore crucial. This project uses Machine Learning and mapping techniques to estimate waste quantities at a local level, aiming to optimize waste management resources.

Project overview

Every year, about 10.7 million tonnes of food waste is produced in the United Kingdom, more than 60% of this is contributed by households (Xameerah et al., 2024). This amounts to approximately £ 250 (pounds) worth of wasted food each year per individual. Beyond the financial costs, the environmental and social costs are just as devastating. According to the Waste and Resources Action Program (WRAP), food waste accounts for approximately 3% of the UK’s total greenhouse gas emissions (Waste and Resources Action Program, 2023). There is also the case for urban planning, an increase in populations in urban cities inherently requires changes in waste management policies and plans especially at the level of Local Waste Authorities. In the UK, the National Planning Policy for Waste (Ministry of Housing, 2014) provides a waste management framework for Local Authorities to create waste management and collection plans aligning with broader sustainability goals. However, it is left for the local authority to assess and implement changes to waste management and collection as they see fit; ideally these assessments should be built on robust data analysis and prediction, but this is not always the case.

This project, using Machine Learning and mapping techniques aims to provide a framework of reference for local authorities to estimate waste quantities in local areas and optimize resources allocated to waste management and collection. The project is in two stages, stage one involves building the model, here significant variables contributing to household waste generation are identified and a statistical model is built for use in stage two. This model generates small area estimates from a sample survey dataset. Stage two involves model testing, estimation and map building where the model is tested on data different from the one used in model building and a map of estimated waste generation across different London output areas is developed.

The scope of the project is initially limited to London, particularly the model is built using data from the area of West London and tested on data from other output areas of London. While there are significant prospects for scaling it to other parts of the country, this would likely require sample surveys from those areas outside of London.

Data and Methods

We worked in collaboration with data partners at Integrated Skills Limited (ISL), a waste management company to get access to residual waste composition data at the output area household level. The initial dataset contained waste collection and composition data from 5 local authority district areas and a total of 53 output areas under the jurisdiction of the West London Local Authority (WLWA) including Brent, Ealing, Harrow, Hounslow, Hillingdon and Richmond from different points in time (June, September and December) across a three-year period from 2021 to 2023 with locations being sampled at multiple time points over the period, a technique known as data pooling.

In terms of data collection, in order to be as representative as possible, households contained in the dataset were identified for inclusion based on the London Output Area Classification (LOAC) (Authority, 2011), which is an open source, publicly available classification of output areas in London provided by the ONS which classifies output areas in London by Supergroup, Group and Subgroup.  Houses in the sample for each of the areas were included to provide as close to precisely accurate representations of the entire area as possible. For example, in Brent, Five LOAC samples were identified as representative. These covered LOAC Super Groups C, E and G, which make up 84% of all Brent households.

For the purposes of this project, it was necessary to aggregate the data from the various areas and years, utilising a composite weighted average to boost sample size in each individual area . This approach was adopted to boost sample sizes for each output area selected. For example, in 2021, one area had a sample size of 10 households with an avoidable food waste per household rate of 33% and a sample size of 30 households in 2023 with an avoidable food waste per household rate of 35%. For this area, the final representation of avoidable food waste in the dataset was arrived at using the following formula:

Composite food waste: (10/(10+30) * 33) + (30/(10+30) * 35) = 34.5%

Also, to get the data suitable for further analysis and modelling, it was imperative to augment the data using publicly available, open-source census data from the 2021 census, provided by the ONS via Nomis which publishes statistics related to population, society and the labour market (Statistics, 2023). Features were identified and added to the dataset based on literature reviews of important socio-economic factors influencing and/or contributing to household waste generation (Wang, Lu and Liu, 2021),(Lozano Lazo, Bojanic Helbingen and Gasparatos, 2023),(Gharagozloo and Ghazizade, 2023; Khadka et al., 2021). The features identified as being most important according to literature include Age of individuals in the output area, Tenureship status (i.e. percentages of houses that were owned versus rented), ethnicity, percentage of dependent children in the output area and levels of unemployment as seen in Table 1 below.

The individual observations in the dataset which initially represented streets within the output areas were then aggregated to represent output areas using a postcode to output area lookup dataset(Statistics, 2023). The choice to convert the unit of representation was based off the fact that the smallest unit of measurement in Census datasets is the output area and so conversion became necessary for ease of executing future steps in the pipeline.

 

Variables in the Original Dataset Variables in the Augmented Dataset
LOAC, Area, Local Authority District, Postcode, Households, Total Food Waste LOAC, Area, Local Authority District, Postcode, Households, Total Food Waste, Over 65 (Nomis table: TS007B), Owned (Nomis table: TS054), Ethnicity (BAME) (Nomis table: TS021), Detached (Nomis table: TS044), Terraced (Nomis table: TS044), Dependant Kids (Nomis table: TS003), Unemployment (Nomis table: TS065), Qualifications (Education Level) (Nomis table: TS067)

Table 1: Data Structure-Pre and Post Augmentation

For the methodology, the initial plan was to approach the problem using methods of statistical spatial micro-simulation. However, we realized this approach was not feasible due to the initial data being at the street instead of household level, which limited the effectiveness of micro-simulation techniques. As a result, we pivoted to using a different method of statistical analysis. The methodology used comprised of three phases: data preprocessing, model building and model testing. The data preprocessing involved data augmentation to build a robust dataset, standardisation and normalisation of features, in Figure 2 below, as well as the identification and removal of variables with high collinearity. The final correlation matrix is seen in Figure 1 below.

Figure 1: Plot showing correlations between variables used in the model. Long Description: This plot shows correlations after variables with multicollinearity have been handled.

Figure 2a: Plot showing data distribution before the application of data transformation and normalisation procedures. Long Description: This plot highlights the non-normality of the variables; normality is one of the assumptions of parametric statistical models and needs to be met.

Figure 2b: Plot showing data distribution after the application of data transformation and normalisation procedures. Long Description: This plot highlights the relative normality of the variables; normality is one of the assumptions of parametric statistical models and needs to be met.

For the model building phase, we decided to adopt a comparative analysis approach, comparing multiple models and picking the best performing one. For this we compared the Ordinary Least Squares (OLS) regression module from the Stats Models Python Package, the Gradient Descent algorithm, the Moore-Penrose Pseudo Inverse and Extreme Gradient Boosting (XGBoost) which is a boosting algorithm. The procedures and results of these models are outlined below:

Ordinary Least Squares Regression (OLS): The Ordinary Least Squares (OLS) regression module from the Stats Models python package which is an open-source statistical package was utilised. The convention with model development is to split the dataset into a train and test set and have a separate validation dataset. However, for this model, the small size of the dataset meant that splitting the data into two sets could result in an overfitted model. Based on this realisation, we decided to use the full dataset for training and then build a dataset comprising of other output areas in London as the validation dataset on which the model was tested.

For preprocessing, we tried three approaches to transforming the dataset; leaving it untransformed, transforming using a log scale and using the Yeo-Johnson transformation and we evaluated each of these based on their ability to predict the median and mimic the original spread of the train data. The one consistency with these approaches was with the model’s predictions, the various approaches regressed towards the mean, that is predictions were erroneously placed towards the centre of the distribution and a lot of variability was left uncaptured.

Gradient Descent: The gradient descent algorithm or in this case, the batch gradient descent algorithm computes the gradients (change) of the model’s parameters with respect to the cost function, in this case, Mean Squared Error (MSE) using all the training observations at once (Ruder, 2016). This is like the OLS approach in that the model parameters are the same but differs in the methodology used to arrive at optimal values for these parameters. While the OLS approach adopts a one-time non-iterative method, the gradient descent iteratively updates the parameters until the gradient of the cost function is zero. However, despite being iterative, this function does not perform much better than the OLS in terms of predicting the distribution of data around the mean.

Moore Penrose Pseudo-Inverse (Pseudo-Inverse): The Moore Penrose Pseudo-Inverse which is a Linear Algebra method works to solve systems of equations by expanding the use cases of Matrix Inversion (Holbrook, 2019). For a Matrix to be inverted, it must be square and linearly independent, however this is not always the case in real life data science applications. The Pseudo-Inverse still makes inversion possible in such cases. This Pseudo-Inverse is used to solve linear systems by finding a vector of solutions that is as close as possible to the target outcome (y). Thus, the multiplication property of invertible matrices:

Where X is the matrix of input values from the independent variables,  is the Pseudo-Inverse, y is the target outcome vector and w is the vector of coefficients that is as close as possible to y in Euclidean distance. With this method, although there was a slight improvement from previous methods, the regression towards the mean was still present.

XGBoost: The XGBoost algorithm can be viewed as an iterative decision tree algorithm that works by observing the loss function of previous decision trees in the chain and adjusts its parameters to reduce loss until the best prediction is arrived at (Chen and Guestrin, 2016). This iterative nature makes the algorithm prone to overfitting, however, the ‘max depth’ parameter helps curb this by controlling how deep the nodes of each individual decision tree can be thereby finding a balance between the bias and variance trade-off. As with most machine learning algorithms, the performance of an algorithm is heavily dependent on the hyperparameter combination. To arrive at the best possible combination of hyperparameters, the ‘GridSearchCV’ algorithm was used as seen in Figure 3 below, the model specification is detailed in Figure 4 below.

Figure 3: Figure showing code used to arrive at best hyperparameter configurations for the XGB algorithm. Long Description: Choosing the right hyperparameters for a model is crucial and this step helps achieve that.

 

Figure 4: Figure specifying code used to run the XGB model. Long Description: Results from figure 3 are used here to run the model and get accurate results.

When compared against the observed/actual data distribution, the predicted distribution  accurately mimics and follows the patterns of the original without regressing towards the mean as seen in Figure 5 and Table 2 below. This provides confidence in the ability of the model to be applied to previously unseen data.

Figure 5: Boxplot comparing distributions between the actual/observed data and the models’ predictions. Long Description: This shows the models’ predictions are closely related to the actual values without being regressed to the mean.

Metric Observed Predicted
Median 33.3 % 33.8 %
Mean 34.3 % 34.8 %
Inter-Quartile Range 10.1 % 10.1 %

Table 2: Comparison of central tendency metrics for Actual and Predicted distributions (XGBoost)

Key Findings

Having compared multiple approaches to solving the least squares regression problem in the provided context, the XGBoost algorithm performed the best and as such was adopted for utilization in the second part of the project involving estimating food waste amounts in other London areas.

Figure 6: Plot showing estimates of food waste distribution across London output areas. Long Description: The plot shows relatively normal distribution with some areas having a greater percentage of waste composition being food waste and some areas having less.

Figure 6 above shows the predicted estimates of food waste across all other London output areas. From this, we can see what looks like a normal distribution with three peaks. It is worth investigating these peaks further to identify possible demographic or geographic similarities responsible for their occurrences.

First off, separating the extreme peaks (i.e areas where food waste content in their bins is below 29% and areas where this is above 42%), gives two gaussian like distributions as in Figure 7 below. Our initial guess had been that these distributions would be localised to certain parts of the city. However, upon investigation, the values in these distributions come from multiple Output Areas across various parts of London as seen in the maps in Figure 8 below.

Figure 7a: Plot showing distribution of food waste in areas with more than 42% of total waste composition being food waste. Long Description: The plot shows very few output areas have more than half of their waste composition being food waste.

Figure 7b: Plot showing distribution of food waste in areas with less than 29% of total waste composition being food waste. Long Description: The plot shows very few output areas have less than a quarter of their waste composition being food waste.

Figure 8a: Map showing distribution of areas highlighted in figure 7a. Long Description: This map highlights the fact that these areas are not localised to one part of London over another.

 

Figure 8b: Map showing distribution of areas highlighted in figure 7b. Long Description: This map highlights the fact that these areas are not localised to one part of London over another.

Figure 9a: Boxplot showing comparison of variable distributions for areas in 8a.

Figure 9b: Boxplot showing comparison of variable distributions for areas in 8b.

Looking at both distributions in Figure 9 above, we can see certain differences in various variables which would require further statistical tests to ascertain the significance of these differences. Furthermore, a comparison of areas present in both peaks became necessary and was implemented, it was discovered that most Local Authority Districts (LADs) were represented in both extremes except for Camden which was absent from the below 29% group. An investigation into Camden and the distributions of independent variables in Camden revealed that median values for home ownership i.e. proportion of homes where the residents own the property, number of terraced and detached properties as well as age i.e., proportion of people aged over 65 was lower in Camden than in other areas as described in Figure 10 below. The models estimation of Camden being in the above 42% group of food waste production could be linked to any combination of the observed variables above. For example, certain literature sources have linked age to food waste with older people being more averse to food waste than younger people (WRAP, 2014).

Figure 10: Boxplot showing comparison of variable distributions for Camden.

It is important to note that the estimations of the model are heavily limited by the amount of data used to train the initial model and further improvements are necessary in this regard. However, the model attaches a feature importance to each of the variables used to train the model and it might be important for Local Authorities and other researchers to take these into consideration for further analysis and research as seen in Figure 11 below. For example, the model attaches high importance to age structure, this agrees with the observations mentioned in (WRAP, 2014) about the relationships between age and food waste.

Figure 11: Plot showing feature importance for variables used in the model. Long Description: This plot highlights the variables the model considered more important than others when learning the data. From this we can see that higher importance is attached to age structure and Tenureship than other variables.

Finally, to aid Local Authorities in demand forecasting, a map of food waste estimates across London Local Authority Districts and Output Areas is provided below in Figure 12.

Figure 12: Map showing distribution of food waste estimates across the output areas in London.

Value of the research

This research provides crucial insights into the socio-economic factors that drive household food waste generation, offering a foundation for more effective waste management strategies. By analysing how different socio-economic variables—such as qualification levels, employment status, and housing tenure—affect the amount and type of waste generated, this study helps to identify key areas where interventions can be targeted.

The findings will be invaluable for local authorities and policymakers aiming to design waste management systems that are not only efficient but also equitable, considering the diverse needs and behaviours of different communities. Furthermore, by using advanced data analysis and machine learning techniques, the research sets a precedent for future studies in waste management, enabling more precise forecasting and better resource allocation. Ultimately, this work contributes to broader environmental sustainability goals by helping to reduce waste, lower greenhouse gas emissions, and promote more sustainable urban living practices.

Quote from project partner

Insights

  • Urbanization has led to a significant increase in municipal solid waste, with the UN projecting waste generation to double by 2025.
  • In the UK, households are responsible for over 60% of the 10.7 million tonnes of food waste produced annually, contributing substantially to greenhouse gas emissions.
  • Waste generation patterns vary based on socio-economic factors such as age, home ownership status, ethnicity, and levels of unemployment within different output areas.
  • Machine learning techniques, combined with robust data analysis, can provide local authorities with better tools for predicting waste quantities and optimizing waste management and collection.

Research theme

  • Health
  • Societies
  • Environment

Programme theme

  • Statistical Data Science
  • Data Science Infrastructures

People

Favour Aghaebe – Data Scientist, Leeds Institute for Data Analytics, University of Leeds.

Dr. William H. James – Lecturer in Geographical Information Science

Prof. Nik Lomax –Professor of Population Geography

Partners

Integrated Skills Limited (ISL)

Funders

Consumer Data Research Centre (CDRC)

References

Authority, G. L. (2011) 2011 London Output Area Classification - London Datastore: Gov.uk. Available at: https://data.london.gov.uk/dataset/london-area-classification (Accessed: 06 August 2024).

Chen, T. and Guestrin, C. 'Xgboost: A scalable tree boosting system'. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785-794.

Gharagozloo, S. and Ghazizade, M. J. (2023) 'The Influence of Socio-Economic and Psychological Factors on the Composition of Household Solid Waste in Farahzad Neighborhood, Tehran, Iran', Environmental Health Insights, 17, pp. 11786302231195794.

Holbrook, R. (2019) Least Squares with the Moore-Penrose Inverse. Available at: https://mathformachines.com/posts/least-squares-with-the-mp-inverse/ (Accessed: 25th August 2024).

Khadka, R., Safa, M., Evans, A., Birendra, K. C. and Poudel, R. (2021) 'Factors influencing municipal solid waste generation and composition in Kathmandu metropolitan city, Nepal'.

Lozano Lazo, D. P., Bojanic Helbingen, C. and Gasparatos, A. (2023) 'Household waste generation, composition and determining factors in rapidly urbanizing developing cities: case study of Santa Cruz de la Sierra, Bolivia', Journal of Material Cycles and Waste Management, 25(1), pp. 565-581.

Ministry of Housing, C. a. L. g. (2014) National Planning Policy for waste. United Kingdom: GOV.UK. Available at: National planning policy for waste - GOV.UK (www.gov.uk) (Accessed: 06 August 2024 2024).

Ruder, S. (2016) 'An overview of gradient descent optimization algorithms', arXiv preprint arXiv:1609.04747.

Statistics, O. f. N. (2023) Get labour market and population data for areas within the UK  - Office for National Statistics. United Kingdom: Office for National Statistics. Available at: https://www.ons.gov.uk/help/localstatistics (Accessed: 06 August 2024 2024).

UN Habitat (2019) Solid Waste Management in Cities. Available at: https://unhabitat.org/sites/default/files/2019/02/Indicator-11.6.1-Training-Module_Solid-waste-in-cities_23-03-2018.pdf.

Wang, K., Lu, J. and Liu, H. (2021) 'Influential factors affecting the generation of kitchen solid waste in Shanghai, China', Journal of the Air & Waste Management Association, 71(4), pp. 501-514.

Waste and Resources Action Program, W. (2023) Household Food and Drink Waste in the United Kingdom 2021-22. Available at: https://wrap.org.uk/resources/report/household-food-and-drink-waste-united-kingdom-2021-22#download-file (Accessed: 01/05/20224 2024).

WRAP, W. a. R. A. P. (2014) 'Household food & drink waste: A people focus', Waste and Resources Action Programme, 1, pp. 2.

Xameerah, M., Louise, S., Iona, S. and Nuala, B. (2024) Food Waste in the UK, London. Available at: CBP-7552.pdf (parliament.uk) (Accessed: 01/05/2024).