Many Analysts, One Data Set – many conclusions

Medscape psychiatry has a good write-up of a very interesting paper by Silberzahn and colleagues in which they got 29 teams, involving 61 analysts, who used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units.

They report that the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions.

Their methodology is particularly interesting and sophisticated, and leaves one in considerable doubt. As the Medscape reviewer says : For years, when I read a scientific paper, I’ve thought that the data yield the published result. What Nosek and his colleagues have found is that results can be highly contingent on the way the researchers analyze the data. And, get this: There is little agreement on the best way to analyze data.

Recommended weekend reading.


I thought we already had a thread that discussed this project…

I agree that the project was interesting, and highlights how many different analytic decisions are made along the way that can influence the final results.

I think the degree of heterogeneity in the final estimates has been rather overstated. The majority of the effect size estimates are very close. There are a few outliers on each end and a bunch of effect sizes that are all in the same ballpark as one another.

One thing that frustrates me is how this is being interpreted by many readers:

Yes, the results (or perhaps we should say the conclusions) are contingent on the way researchers analyze the data, but some people seem to think this means we’re doing math differently. This has much more to do with other decisions - what variables should be “adjusted” for (requires causal thinking often in absence…) and what is the proper way to model the outcome - than the actual computations themselves.


I used the paper as the basis of an assignment in my mixed undergrad/grad course on reproducible and collaborative statistical data science this term.

What impressed me most about the paper was that the research question was not well specified and indeed very hard to turn into a statistical question; almost all the analyses relied on insane pre-packaged models with no clear connection to the phenomenon; all the teams tried several analyses (indeed, the experiment was set up so that teams gave each other feedback between an initial analysis and a final analysis) but did nothing to take into account multiplicity and selective inference; there’s nothing stochastic in the problem, but almost all reported confidence intervals or credible intervals, and the main authors created CIs for those that didn’t; the odds ratio is defined in bizarre and incommensurable ways in some of the analyses, so comparing the OR estimates doesn’t make sense; etc.

It appeared from the affiliations that few if any of the teams contained bona fide statisticians. Most seemed to be methodologists working in psychology, sociology, etc…

I think what the paper demonstrates is that (some) practitioners in social science pick statistical methods as a sort of pattern-matching exercise, without paying much attention the assumptions or the interpretation of the results: Cargo-cult Statistics.


I totally agree with @pbstark. Indeed, this is a great starting point for any course on applied statistics. To say the obvious: performing statistical analysis is way more that just selecting a package and throwing in the data, as many do. Whoever who does that is neither a statistician nor a data scientist, they are just technicians.

To my opinion, there is one more lessoned to be learned here, this time for applied statisticians: The paper proposes a model for research workflow, and I think that it is a good one and should be adapted. In particular, it enpasizes the importance of collaboration in teamwork, and such collaboration should include peer review of statistical analysis plans or statistical sections in protocols , the analysis code and the results, as well as other things.


This is a good point.

The question was, apparently, if referees showed a red card to dark-skinned players more frequently than to others. This (trivial) hypothesis can be tested from a simple incidence rate ratio : number of red cards per player-hour.

The problem was that the various teams of statisticians then started building models to test other hypotheses involving why the observed effect existed. Of course, they all built different models because 1) there was no underlying theory, so everyone had to make up their own and b) some of them probably knew as little about football as I do.

If there is a warning here, it’s don’t let the statistician build the theory! Theories are best built by the subject matter experts and tested by statisticians.


Bingo. I think statisticians should be at the table during these discussions - the better for us to understand the effects we are trying to model - but not necessarily making the decisions about what should be “adjusted for” (because that changes the inference of the findings).


I’m going to play devil’s advocate and offer a counter narrative to this original quote.

To many clinicians (like me), the original quote resonates, because our initial assumptions about “data analysis” was incorrect. Many of us thought data analysis was a form of deductive analysis, where the output would always follow from the input. The conceptual realization that “data analysis” is modelling, a process that involves decisions beyond number crunching, is a massive surprise.

How does one react to this revelation? As non-experts, how do we now appraise studies whose results are influenced by analytic decisions we don’t understand, and by decisions that may vary from one paper to the next?

From this perspective, the original quote should be treated as a wonderful thing. It’s an opportunity to engage clinicians who finally realize that they have a knowledge gap. And maybe reassure us when studies do have consistent approaches to analysis (it’s not that bad!), or educate when studies may be subject to more variability then we realize (it is that bad!).


This is a wonderful question. Unfortunately as the paper being discussed exemplifies, there is analysis/statistician variation in addition to sampling and measurement variation. This leads to a few random observations:

  • The best we usually hope for in empirical work (outside of math and physics were equations can be argued from first principles) is to choose a method that is competitive, i.e., that is not easily beat by a method that someone else might choose.
  • The method should be chosen on the basis of statistical principles, simulation performance, and suitability for the task/data.
  • The method should be general enough to encompass complexities that other statisticians may have chosen a different method to handle.

The frequentist/Bayesian battle has a lot to do with the last point. Frequentist modeling invites a kind of dichotomization, e.g., do we include or exclude interaction terms? Do we allow the conditional variance of Y | X to vary over groups of observations? Do we entertain non-normality of Y | X? These issues create model uncertainty, require huge sample sizes to answer definitively, and result in non-preservation of confidence coverage and type I error. The Bayesian approach is the only way to take analysis uncertainties into account, by having parameters for everything you think may be a problem but don’t know. For example, in a 2-sample t-test you can have a parameter for the ratio of variances of the two groups, and a prior for that that favors 1.0 but allows for ratios between 1/4 and 4.

Explicitly allowing for model uncertainty preserves the operating characteristics of the method and results in just the right amount of uncertainty/conservatism in the final analysis.

Besides choosing between frequentist, Bayesian, and likelihood schools of modeling/inference, statisticians tend to be overly divided by how nonparametric they are. The majority of statisticians use linear models for continuous Y. I don’t, since I believe that semiparametric ordinal response models (e.g., proportional odds or proportional hazard model) are more robust and powerful, and most importantly, free the analyst from having to make a decision about how to transform Y (and how to penalize for any data-based choice of transformation). The proportional odds model has the WIlcoxon and Kruskal-Wallis tests as special cases, and robustly handles bizarre distributions of Y | X including clumping at zero and bimodality.

Then there is the issue of prior beliefs in simplicity, e.g., additivity and linearity vs. routinely allowing effects to be smooth and nonlinear as I do. A side note: I can often predict that an analysis will be shown to have problems by seeing the choice of software the authors used. Papers using SPSS or SAS tend remarkably to show linear effects more often than papers using R or Stata. That’s because SPSS and SAS make it harder to handle nonlinearities, and statistician laziness sets in.


Thanks Frank for alerting me to the interesting discussion in this and the earlier thread about the Many Analysts paper.

The theme that I like most emerging from this discussion is the observation that operationalization of the question/hypothesis into a statistical model is non-trivial, underappreciated, and important. I particularly agree with the suggestion that closer collaboration between the subject matter experts and statisticians will help illuminate that this step is critical and imbued with its own assumptions, style, and uncertainty.


This should be qualified to read: 'Don’t let the statisticians test theories they had no input on building from data they had no input on collecting." :stuck_out_tongue_closed_eyes:


Statisticians have traditionally shied away from recommending decisions to be made from the data based on the conclusions of a study.

But it certainly seems to me that they could conduct sensitivity analyses - with input from the investigators responsible for a study - on how these conclusions may affect the subsequent decision-making.

If the decision-making would be consistent across a range of conclusions, then one would take some comfort that the underlying methodology used to reach those conclusions would not adversely impact the decisions being made.

Otherwise, one would have to go back to the drawing board and try to resolve the conflicting messages coming out of the competing analyses, if that is even possible.

I’ve always had trouble with sensitivity analysis. When the different approaches disagree it gives those who favor a certain answer an excuse to use the analysis that most closely provides that. Contrast that with a principled selection of “the” analysis, which is the way I like to operate in most cases.