Many Analysts, One Data Set – many conclusions



Medscape psychiatry has a good write-up of a very interesting paper by Silberzahn and colleagues in which they got 29 teams, involving 61 analysts, who used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units.

They report that the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions.

Their methodology is particularly interesting and sophisticated, and leaves one in considerable doubt. As the Medscape reviewer says : For years, when I read a scientific paper, I’ve thought that the data yield the published result. What Nosek and his colleagues have found is that results can be highly contingent on the way the researchers analyze the data. And, get this: There is little agreement on the best way to analyze data.

Recommended weekend reading.


I thought we already had a thread that discussed this project…

I agree that the project was interesting, and highlights how many different analytic decisions are made along the way that can influence the final results.

I think the degree of heterogeneity in the final estimates has been rather overstated. The majority of the effect size estimates are very close. There are a few outliers on each end and a bunch of effect sizes that are all in the same ballpark as one another.

One thing that frustrates me is how this is being interpreted by many readers:

Yes, the results (or perhaps we should say the conclusions) are contingent on the way researchers analyze the data, but some people seem to think this means we’re doing math differently. This has much more to do with other decisions - what variables should be “adjusted” for (requires causal thinking often in absence…) and what is the proper way to model the outcome - than the actual computations themselves.


I used the paper as the basis of an assignment in my mixed undergrad/grad course on reproducible and collaborative statistical data science this term.

What impressed me most about the paper was that the research question was not well specified and indeed very hard to turn into a statistical question; almost all the analyses relied on insane pre-packaged models with no clear connection to the phenomenon; all the teams tried several analyses (indeed, the experiment was set up so that teams gave each other feedback between an initial analysis and a final analysis) but did nothing to take into account multiplicity and selective inference; there’s nothing stochastic in the problem, but almost all reported confidence intervals or credible intervals, and the main authors created CIs for those that didn’t; the odds ratio is defined in bizarre and incommensurable ways in some of the analyses, so comparing the OR estimates doesn’t make sense; etc.

It appeared from the affiliations that few if any of the teams contained bona fide statisticians. Most seemed to be methodologists working in psychology, sociology, etc…

I think what the paper demonstrates is that (some) practitioners in social science pick statistical methods as a sort of pattern-matching exercise, without paying much attention the assumptions or the interpretation of the results: Cargo-cult Statistics.


I totally agree with @pbstark. Indeed, this is a great starting point for any course on applied statistics. To say the obvious: performing statistical analysis is way more that just selecting a package and throwing in the data, as many do. Whoever who does that is neither a statistician nor a data scientist, they are just technicians.

To my opinion, there is one more lessoned to be learned here, this time for applied statisticians: The paper proposes a model for research workflow, and I think that it is a good one and should be adapted. In particular, it enpasizes the importance of collaboration in teamwork, and such collaboration should include peer review of statistical analysis plans or statistical sections in protocols , the analysis code and the results, as well as other things.


This is a good point.

The question was, apparently, if referees showed a red card to dark-skinned players more frequently than to others. This (trivial) hypothesis can be tested from a simple incidence rate ratio : number of red cards per player-hour.

The problem was that the various teams of statisticians then started building models to test other hypotheses involving why the observed effect existed. Of course, they all built different models because 1) there was no underlying theory, so everyone had to make up their own and b) some of them probably knew as little about football as I do.

If there is a warning here, it’s don’t let the statistician build the theory! Theories are best built by the subject matter experts and tested by statisticians.


Bingo. I think statisticians should be at the table during these discussions - the better for us to understand the effects we are trying to model - but not necessarily making the decisions about what should be “adjusted for” (because that changes the inference of the findings).


I’m going to play devil’s advocate and offer a counter narrative to this original quote.

To many clinicians (like me), the original quote resonates, because our initial assumptions about “data analysis” was incorrect. Many of us thought data analysis was a form of deductive analysis, where the output would always follow from the input. The conceptual realization that “data analysis” is modelling, a process that involves decisions beyond number crunching, is a massive surprise.

How does one react to this revelation? As non-experts, how do we now appraise studies whose results are influenced by analytic decisions we don’t understand, and by decisions that may vary from one paper to the next?

From this perspective, the original quote should be treated as a wonderful thing. It’s an opportunity to engage clinicians who finally realize that they have a knowledge gap. And maybe reassure us when studies do have consistent approaches to analysis (it’s not that bad!), or educate when studies may be subject to more variability then we realize (it is that bad!).