Biobanks, data dredging and science-by-press-release

sadneurons · April 29, 2025, 10:22am

Hello all,

Motivation: Big, hypothesis-free data analysis studies using e.g. the UK Biobank, or other equivalent resources (often large epidemiological studies) make the news, because of their unlikely, sometimes implausible, mostly unreproducible, findings. See here for just one recent example. This has become such a famous bete noir that the Daily Mail Oncological Ontology Project (now retired I think), made a wonderful satire of the practice. My suspicion is that alot of these kinds of findings are the result of residual confounding of SEG.

I’m wondering if anyone can point me to useful materials on the actual statistical mechanics of this kind of daftness. “Findings” like these have significant impact on my interactions with patients (I’m a dementia specialist and researcher and also now suffering a severe case of late-onset mathematics leading to mid-life undergraduate study - blame FH!). To my mind this kind of thing undermines the legitimacy of science in the public discourse/imagination.

There are obvious potential baddies:

Failure to correct for multiple testing
Researcher degrees of freedom unacknowledged
The publish or perish culture, combined with University comms departments and the media creating a hot mess of poor science and poor science communication .

But, I’m a bit more interested in whether, for example:

A non-linear relationship between e.g. SEG and variable X can cause residual confounding when both SEG and X are included in a simple multivariable model with linear assumptions.
Would a hierarchical/mixed effects model reduce the risk of such confounding?
Because wealth is Pareto distributed does the categorisation into SEG classes inevitably lead to residual confounding - i.e. can it be expected to, and could data access be made contingent on a commitment to avoiding this kind of error?
What are other statistical sources of this kind of error?
Can we effectively protect against it by using the methods of causal inference, e.g. clear DAG development, or simply pre-registration?
More broadly, do people think the drive to bad science/comms can in fact be overcome by “mere” methodological guardrails?

I’m especially interested in whether this has already been studied, how and by whom. I can find very little on it outside the grey/popular literature. With sincerest thanks.

f2harrell · April 29, 2025, 10:52am

The following comments don’t do justice to your excellent points and questions.

Two things come to mind. First, @sander has devoted his career to these kinds of topics and I hope he responds here. His 2000 paper is a classic. Second, a major branch point in observational research is whether data were prospectively collected under a protocol or whether they are “convenience data”. Prospective data with consistent definitions of measurements/variables, unbiased assessments, minimal missingness, a hard time zero, definite inclusion criteria, … are required for me to trust the results unless the types of data being collected are very simple. I don’t trust casually collected dietary or socioeconomic data, for example.

davidcnorrismd · April 30, 2025, 10:31am

By SEG do you mean socioeconomic status (most often abbreviated SES)?

davidcnorrismd · April 30, 2025, 10:40am

David Freedman has a beautiful argument against such over-reliance on mere technique; see this CrossValidated answer.

sadneurons · April 30, 2025, 7:16pm

Apologies, Socio Economic Group, yes. Perhaps I’m unconsciously referring to the categorization.