Big Data, Big Bias? Evidence on the effects of selection bias in large observational studies

Posted by Robert Matthews on January 11, 2024

In this 11-minute audio, statistician Robert Matthews talks about some important new evidence about the reliability of information from large, self-selecting data sets, such as Biobanks and social media data trawling.


Robert MatthewsI’m Robert Matthews, visiting professor in statistics at Aston University in Birmingham, and also the Statistical Advisor to the James Lind Library.

One of my interests is on sources of bias in databases. This has become a very hot topic recently because of the results starting to emerge from what are called biobank studies. These are huge databases of information drawn from blood samples and other sources based on literally hundreds of thousands of people.

There’s huge interest in this because as anybody who knows anything about statistics knows, the bigger the sample the better, at least that’s the sort of broad brush message.

Advocates of this approach, which is called “Big Data”, point to the fact that the larger the sample the more precise the result you get, which sounds fine and dandy.

But there’s a bit of a problem with this, and that is what is the nature of the sample that you’re using? How have you actually drawn it from the population that you’re interested in?

So back in the day, and I’m talking like 80, 90 years ago, people had a view that, well it’d be good to have a representative sample, and they came up with various ways of achieving that. Then by the late 1940s people were beginning to realize that randomization, a random sample, is probably the gold standard.

The reason for that is if there are biases buried in there, they tend to cancel each other out if you’ve taken a random sample. That’s the basic idea.

So that all worked fine and dandy, and we’ve had some major medical breakthroughs based on the back of using randomized samples.

But they’re typically a bit on the small side. The idea behind a biobank is that you can have, as I said, 100,000 or more people drawn from a population, and therefore you’ll get a nice tight result, with low variance.

But the problem is, because it’s not randomized, you’ve not been able to get people at random to join this biobank scheme you set up, there’s a possibility that you do have biases, and the most obvious one would be Selection Bias. So this has been the cloud hanging over this Big Data revolution.

So a specific example of this would be the UK Biobank project that was set up in the mid 2000s, and the idea was to recruit half a million people, and take biological samples and other data from them, and use that to scour through the database to find associations between genes and certain characteristics and health effects, one’s lifestyle practices and the resulting health effect. All sorts of really important things like that.

But not surprisingly statisticians were a bit concerned about potential biases, because it’s not a random sample. In fact it’s been accepted now by the people who set up UK Biobank that it’s not a random sample. They only had a 5% response rate, and it was pretty heavily self-selecting. It’s basically what they call the “healthy volunteer” effect, that the people who took part were those who were fit and able and to some extent simply mentally competent to undergo a full half day of tests and sample taking and measurements and things like that. I know because I was one of the participants. But the view was that it would all work out somehow in the wash, because of the sheer size of the sample.

But within the last five years evidence has emerged that in fact these people who have taken part are abnormal in a sense, and that, for example, they have much lower total mortality and total cancer rate than the general population at the age group of, like, you know, 70 plus for example. So clearly there’s a question over how representative they really are. But again, the view was when this was found that, oh it’ll all come out in the wash, everything will be fine and we’ll have precise answers to these questions.

But in July this year [2023] the journal Nature human behaviour published a study by Tabea Schoeler at the University of Lausanne in Switzerland and colleagues, who looked at the specific issue of genetic correlations between socio-behavioural traits, which is of great interest. What they looked for was whether there were signs of bias leading to exaggerated, or in some cases potentially underplayed associations between your genes and things like your behaviour, your lifestyle, educational outcomes and things like that, which are of great interest.

And what they found was pretty substantial evidence of bias, which they had to apply corrections to and, of course, applying corrections to anything is slightly questionable. But they have basically flagged this up as proof positive that we do have to be careful about the impact of using non-random samples when we’re drawing insights from potential genetic correlations with behavioural, lifestyle and educational outcomes.

So it’s no longer just a statistical bit of hand-wringing by neurotic statisticians. If you move away from the gold standard of randomization, expect trouble!

Now this is not just in things like Biobank, and it might be fixable.  We don’t know. Statisticians are very good at coming up with ways of fixing, but it’s far from ideal. But this idea of Big Data generally, and the focus on precision, that that means very low variance, very small error bars, as being a measure of how inferentially useful things are, has caused a lot of concern because it ignores the possibility that you can get a result that’s very precise but also pretty wrong.

In other words it’s a bit like if you’re an archer taking part in an archery competition, what would you rather be? An archer who can shoot arrows at the same point very tightly so the all the shots are pretty tightly clustered, but they’re miles away from the bullseye that you’re trying to hit? Or would you prefer a greater scatter that at least is sort of vaguely centred on the bullseye? In other words what would you prefer to be: precisely wrong or roughly accurate?

So this has been investigated by Xiao-Li Meng on a theoretical basis at Harvard University. He has developed theories for the impact of even a tiny bit of bias in the participation in a sample and shown to what extent that undermines its reliability, and the way he crystallizes it is by saying: if you have a certain amount of bias in participation rates, what does that make this huge sample equivalent to in terms of a randomized controlled trial? And the impact is shocking.

If you have just a little bit of bias it can turn a non-random convenience sample, or however you got it, of hundreds of thousands into a randomized control trial of a size of just a few dozen, in terms of its reliability. It is that big an impact. Again, this is no longer a bit of sort of theoretical hand-wringing by statisticians.

There was a very interesting paper came out in 2021 in the journal Nature where the authors were led by Valerie Bradley at Oxford University, and they looked at attempts to measure during the pandemic, the rates of US vaccine uptake to see if the message was getting through.

One way to do this was to use Facebook: what people were saying on Facebook about whether they’d had the vaccine or not. And you’re thinking well, yeah, it’s a huge sample. In fact you know it’s hundreds of thousands, not a problem there, but a bit of a problem with how well that represents the US population as a whole.

At the same time, people undertook the classic voting polling approach to measuring vaccine rates, by going out and having a random sample of like a thousand or so people, but randomized. And they came up with pretty disparate rates of vaccine uptake. So the question is well, which was right? Was it the far smaller classic randomized sample or was it the huge, very convenient Big Data Facebook trawl?

Well no prizes for guessing, it was the traditional approach to doing it. Now how do we know that? Well, that’s what makes this paper so interesting. We know what the reality is, because the Centers for Disease Control in in the United States kept records of who’s had vaccines. So we know what the real rate of take up was, and it was much closer to what the randomized control sample of a of a thousand or so participants found, compared to the vast Facebook study.

So again proof positive that, if you move away from the classic random approach to participation in samples and surveys, expect trouble! Again, Meng did the calculation, he showed that you know Facebook sample of 250,000 people was equivalent in terms of its reliability, its accuracy – despite having a far smaller variance, in other words looking impressively precise – it was equivalent to a randomized control trial of 10 people. It is that bad!

So two very strong papers, I would submit, that show that statisticians aren’t just being neurotic about this increasing tendency to put stress on precision through using large databases. People need to be aware that, precision is good, but ultimately we want things to be accurate as well, and if you move away from randomized samples you might be getting one but not the other.

References and further reading

Bradley VC, Kuriawaki S, Isakov M, Sejdinovic D, Meng X-L, Flaxman S. Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature 2021; 600:695–700

Nunan D, Bankhead C, & Aronson JK for the Catalogue of Bias Collaboration,  Selection bias. Catalogue Of Bias 2017:  Accessed 11th December 2023.

Powell A. 2 early vaccination surveys worse than worthless thanks to ‘big data paradox,’ analysts say. Harvard Gazette 2021, December 8. Accessed 11/12/23

Schoeler T, Speed D, Porcu E, Pirastu N, Pingault J-B & Kutalik Z. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nature Human Behaviour 2023;7:1216-1227.



Key articles

  • Held L, Matthews RAJ (2022)
    Paradigm lost: Carl Liebermeister and the development of modern medical statistics.