In the USA, the CDC (Centers for Disease Control and Prevention) compiles data on national vaccine uptake, but reporting can sometimes be delayed. Surveys that measure attitudes and behaviour towards COVID-19 vaccines can fill a gap when there is a lag in real-time data, and can inform government responses to the epidemic. However, some surveys diverge substantially in their findings.
The authors of a new paper published in Nature today find that in May 2021, Delphi–Facebook’s COVID-19 symptom tracker (250,000 responses per week) overestimated vaccine uptake by 17 percentage points, and a survey from the US Census Bureau (75,000 responses per wave) overestimated vaccine uptake by 14 percentage points compared to benchmark estimates from the CDC.
These overestimation errors go orders of magnitude beyond the statistical uncertainty provided by the surveys. A survey by Axios–Ipsos also overestimated uptake, but by a smaller amount (4.2 percentage points in May 2021)—despite being the smallest survey (about 1,000 responses per week). These findings indicate that bigger is not always better when it comes to datasets, if we fail to account for data quality.
The authors note that COVID-19 vaccine uptake was not the primary focus of any of the surveys. For example, Delphi-Facebook intends to measure changes in COVID-related behaviours over time. However, the bias in estimates of vaccine uptake in the two large surveys indicates that they are not representative of the US adult population. Lack of statistical representativeness may also be causing bias in other survey outcomes they suggest.
The research presented in the paper suggests that design choices in survey data collection can lead to inaccuracies that are not overcome by sample quantity; for example, the three surveys use different recruitment methods, which may introduce different biases in their estimates. The authors conclude that efforts to measure data quality and improve the accuracy of assessments of vaccine uptake are needed to better inform public policy decisions.
Co-author Seth Flaxman from the University of Oxford, Department of Computer Science, comments, ‘Human behaviour is key to Covid-19 pandemic response. Who is wearing masks? Have they gotten the message that Covid is airborne? For those who haven't been vaccinated, what reasons do they give – hesitancy, access, or lack of trustworthy information? Surveys are our primary source of information to answer these questions. But when data is collected online, especially through mobile apps, we should always ask: how do we know whether this data is representative of the population?
'Trump voters were famously less likely to respond to opinion polls before the 2016 election – do we see something similar with those who are vaccine hesitant? There is an urban/rural divide in terms of Internet access – and also vaccine access. Who is being left out of vaccine outreach campaigns guided by online surveys?
‘Our study is the first to use a widely available benchmark – how many people have received the Covid-19 vaccine? – to evaluate the accuracy of online surveys. We find that in the US, the two largest surveys, from Facebook and the Census Bureau, significantly overestimated vaccine uptake despite enormous sample sizes. By contrast, a smaller survey from Ipsos was very accurate. Bigger isn't always better.
Online surveys supply a plethora of data on everything from Covid-19 symptoms to in-person schooling to social distancing behaviour. We're trying to raise the alarm that data quality can be a huge issue, and even random samples – the gold standard for surveys--can still have a lot of bias because of who is willing or able to respond.’
Co-first author Valerie Bradley from the University of Oxford, Department of Statistics, comments, 'It is well known that it’s important to adjust for unrepresentativeness in traditional surveys, however there isn’t the same sense of urgency around adjusting for unrepresentativeness in large surveys, or Big Data more broadly. It’s easy to get lulled into thinking that large sample size will overcome bias due to unrepresentativeness, when if fact we know the opposite is true – large sample size actually magnifies any bias in data collection; this is the Big Data Paradox.
'Large data sets, like the Facebook-Delphi and Census Household Pulse surveys, can still be incredibly useful and can enable types of analysis that are impossible with smaller data sets, but still require quality control during data collection and careful adjustment after data is collected in order to ensure the accuracy of estimates.’
Xiao-Li Meng, senior co-author at Harvard University, says, ‘The magic power of surveys – being able to learn reliably about a population of hundreds of millions from a few hundreds of individuals – comes from the key assumption of being representative, much like a spoonful or two is all we need to determine the saltiness of a soup regardless of the size of its container, if we have stir the soup well. But when this "stirred-well" assumption fails, which is the case for most social media data because users are not randomly selected to join social media groups, we can show mathematically that surveys will lose essentially all of their magic.
'Worse, an unrepresentative big survey can seriously mislead because of the false confidence created by its size, as we demonstrate in this article – answers from 250,000 Facebook users can inform us about the US vaccination rate no more reliably than 10 randomly selected individuals from the same population.’