A statistical misadventure leading to retraction

In an interesting case recounted with humility and good humor in this open-access Editorial [1], the specification of clustered variances and fixed effects at the same level resulted in wrong inferences in a highly touted article. (On the technical point, [1] cites this presentation by Austin Nichols & Mark Schaffer; see slide 8.)

I would guess many of you will find valuable lessons to draw for yourselves and your students from this case. I’ll just offer a few of my own quick observations, plus one question.

  • This is a major reason why I am a ‘methodological Bayesian’. Using Bayesian methods has the effect of rendering all of the most substantive aspects of the modeling transparently, while submerging all of the least important stuff into the ‘magic’ of MCMC.
  • Even bigger lesson is, when you have a result you plan to tout highly, you should try several different modeling approaches. Ironically, the original piece employed ‘cluster-robust’ variance estimators, yet did not seek genuine robustness to variation in model specifications.
  • Fallibilism is true.

Question: It seems to me inconceivable that a mishap like this could befall a Bayesian analyst doing hierarchical modeling; am I wrong? Does the statistical modeling/interpretation error leading to this retraction have a Bayesian analogue?

  1. Shafer SL. Broken Hearts. Anesthesia & Analgesia. 2016;122(5):1231-1233. doi:10.1213/ANE.0000000000001253

what were the anesthesiologists’ fixed effects, and why fit them? i see only patient characteristics listed

wouldnt the following apply to bayes also: “hierarchical models require the assumption that there is no correlation between the patient characteristics and the quality of the anesthesiologists (or surgeons or hospitals), a set of assumptions not required with fixed-effects models. Thus, the assumptions of hierarchical modeling are frequently violated when used to estimate provider quality because hierar
chical modeling assumes that the random effect (provider effect) is not correlated with patient risk.”


As I understand it, such fixed effects would aim to measure an ‘anesthesiologist quality’ construct.

As I’d use the term, hierarchical modeling subsumes both ‘fixed’ and ‘random’ effects, a point that Gelman’s rejection of those latter terms nicely underscores. (On this point of terminology, I especially like the discussion in Gelman & Hill which I reference in this CrossValidated answer.)

I believe the argument you offer within quotes exhibits a frequentist’s arbitrarily fractured view of statistical methods (with fixed-effects models over here, hierarchical models over there, …) as against the coherent and unified view Bayesianism delivers (cf. “everything is a parameter”).

While I agree that a Bayesian hierarchical analyst who lacked sufficient domain knowledge might overlook the correlation you mention, this would represent a failure in the substantive realm of model realism. No mere technology can protect the modeler from a lack of metaphysical realism. By contrast, the modeling failure highlighted in my opening post seems to me purely technical in nature. It is against such strictly technical errors that I am suggesting Bayesian methods may be protective.

The statistical points are too subtle for me. But surely no-one disputes that “some anesthetists are better than others”! Although I guess we have discovered that the methods originally used in this paper are a poor way to identify the good and the bad.

1 Like

Jim, I would dispute that your formulation of this claim is sharp enough to be scientifically useful. (Karl Popper had a lovely way of stressing the importance of sharply defined theories, which I introduce by way of a quote in this comment on Frank’s blog.)

To go beyond searching for a vague, unitary ‘oomph’, one would need to set forth various dimensions of anesthesia practice, and identify which contexts will most severely test each dimension. Might some anesthesiologists excel in cases where high blood loss is involved? Might others excel in cases with compromised lung function? While some common dispositional or personality factors (which statistical method might help to ‘discover’) may underlie all excellent anesthesia performance, perhaps specific knowledge, training and preparation (having certain drugs at the ready, say) also matter in specific circumstances.


You’re right David. However, in my defence I was quoting Steven Schafer’s editorial. https://journals.lww.com/anesthesia-analgesia/Fulltext/2016/05000/Broken_Hearts.1.aspx
To be fair the original paper with the erroneous statistics was making a more precise Popperian claim. “Some anesthetists are better than others, and we can identify those who are so much better, or so much worse than the average that we can be confident the play of chance is not misleading us”. Sadly the correct analysis refuted that claim.

1 Like