What do we know about the accuracy of secondary food environment data, and what do we still need to discover?

Date: Monday 19 September 2016

Emma Wilkins is a Researcher in Obesity, Nutrition, Data Analytics and Health Geography. Based at Leeds Beckett University and the University of Leeds (Leeds Institute for Data Analytics) since February 2016, her research focuses on the use of ‘Big Data’ to understand how the built environment affects diet and obesity.

Policymakers and researchers have recently been exploring the ‘food environment’ (broadly defined as the opportunities available to access food), and the links this may have with diet and obesity. This is an exciting concept, because it suggests food environments can be designed to promote health. However, in order to establish whether such links exist and to test the effectiveness of any food environment modifications, we need to be able to measure the food environment.

While there are several ways that the food environment can be measured or characterised, the most common technique uses Geographic Information Systems to quantify the spatial accessibility of food outlets. This might be enumerated as e.g. the density of food outlets within an area, or the proximity of outlets to homes or workplaces.

In order to identify all food outlets within a study area, some researchers opt to conduct street audits (i.e. an auditor physically walks the streets of the study area and locates all outlets). This is considered the most accurate, ‘gold standard’ method. However, as street audits are expensive and time-consuming, a far more common approach is to use ‘secondary data’ – i.e. data that was collected for purposes other than food environment research. In the UK, common secondary data sources include data from Ordnance Survey (the UK’s national mapping agency), data collected by Local Authorities for the purpose of conducting food hygiene inspections, and data from business directories such as Yellow Pages or Thompson Local.

Is this secondary data accurate?

A major concern with the use of secondary data is its accuracy. As secondary data is collected by third parties, the researcher cannot ensure the accuracy of the data collection and recordal processes. Inaccurate data makes it harder to detect environmental effects, and given that any such effects are likely to be small, the accuracy of data is therefore very important. Additionally, if inaccuracies are non-random (e.g. if data tends to be less complete in deprived areas), then research findings may be biased, leading us to think e.g. that an association exists when it does not. Indeed, a recent study in the US (Mendez et al., 2016) has shown that the use of different data sources can lead to different conclusions about the associations between food outlet density and area-level demographics, with the likelihood of detecting a significant association being dependent on the data source.

Unfortunately, we know relatively little about the accuracy of commonly used data sources in the UK. Food Hygiene data has been the most investigated (Cummings & Macintyre 2009; Lake et al., 2010; Lake et al., 2012), and it seems that this data source generally has good accuracy. Between 5-19% of outlets identified in street audits are not included in Food Hygiene data (although this figure may be as high as 40% in some environments), and between 8.5-21% of the Food Hygiene data comprises erroneous food outlet listings (i.e. outlets that are not present in reality). One downside of Food Hygiene data is that it is collected independently by each Local Authority. Thus, for large-scale studies the researcher must make individual requests to each Local Authority within the study area, which is time consuming and labour intensive. Additionally, different LAs may code data differently, leading to difficulties in collating the data (Burgoine, 2010). Thus, data that is available on a national scale may be preferential to Food Hygiene data, if this is found to be of comparative accuracy.

Few studies have investigated the accuracy of alternative nationally available data sources. One study (Lake et al., 2010) found business directory data (Yellow Pages and Yell.com) to have generally poor accuracy, missing as many as 49% of outlets in the field, and containing 20.9% erroneous entries. It should be noted however that these figures were derived for a very broad range of food retailers, including wholesalers and suppliers. It is unclear whether the business directory would have performed better for more commonly investigated retailers e.g. supermarkets and restaurants.

Only one study (Burgoine & Harrison, 2013) has investigated the accuracy of Ordnance Survey data, despite its ubiquity in UK-based research. In this study, Ordnance Survey data was compared to Food Hygiene data. The data sources exhibited moderate agreement, with the Ordnance Survey containing 59.9% of outlets listed in the Food Hygiene data, and the Food Hygiene data listing 74.9% of outlets included in the Ordnance Survey data. As the data was not compared to gold standard of street audits, it is unclear how these sources compare to the true food environment.

What do we still need to find out?

There are a number of questions we still need answers to and areas where agreement is needed before we can draw conclusions regarding the validity of using secondary data to measure the food environment.

What is the accuracy of commonly used data sources on a national scale?
Existing UK studies have been done on a relatively small scale, and thus findings are highly context-specific. In order for findings to be generalised to the UK as a whole, studies are needed that assess the accuracy of secondary data across a broad range of environments.

Do data sources have variable accuracy across different types of environments or different types of food outlet?
As mentioned above, this is an important question because non-random variations in accuracy can lead to confounded results. Existing data suggests that accuracy may be variable between urban and rural environments and for different types of outlet (Burgoine & Harrison, 2013), but further research is needed to corroborate these findings across more diverse environments.

How should we measure accuracy?
The majority of studies assess the accuracy of data sources by calculating ‘positive predictive values’ (conceptually this is a measure of how many erroneous entries are included in the data) and ‘sensitivities’ (conceptually this is a measure of how many food outlets are missed by the secondary data). However, it’s not clear whether we should have separate measures for erroneous and missing entries; perhaps they compensate for one another? Clary & Kestens (2013) propose an alternative accuracy measure (termed ‘representativity’) which accounts for this compensatory effect. Maybe this should be used in future research?

What level of accuracy do we really need?
This question relates to how we assess agreement between secondary data and street audit data. Do we consider the data to agree only if the locations and names of outlet agree exactly? Or can we take a more relaxed approach, e.g. allowing the names to be different as long as the categories are the same? How relaxed can we be with regard to the spatial accuracy of the secondary data? The answers to these questions may depend on how we are using the data and the types of analyses we are performing. For example, if we want to assess the density of outlets within an area, then the exact location of outlets within that area don’t matter too much, as long as the total count of outlets within the area is roughly accurate. If we are measuring the proximity of outlets to a home, however, then spatial accuracy may be more important. The majority of existing UK studies have not been clear about how they have defined agreement between data and have likely adopted differing approaches.

What is the effect of using different data sources?
Finally, more research is needed into the effects of using different data sources on research findings. As mentioned above, one recent study from the US found that the choice of data source had quite a substantial effect on research findings. However, another study (Hobbs et al., 2016) from the UK did not find such a marked difference when comparing Food Hygiene and Ordnance Survey data, suggesting perhaps that these sources are equivalent.

This article was originally published on Emma’s blog.