Principles and guidelines for applied statistics



This brings up another interesting case study.

Frank and I had a fair amount of discussion with some physicians on Twitter regarding this trial:

“The primary outcome, death, disabling stroke, serious bleeding, or cardiac arrest at 5 years for ablation vs. drug therapy, was 8% vs. 9.2% (hazard ratio [HR] 0.86, 95% confidence interval [CI] 0.65-1.15, p = 0.3)

I took issue with the way these results were frequently reported at the original meeting where they were presented, in the press, and circulated on social media among physicians. Usually something along the following lines:

“The results of this important trial indicate that ablation is not superior to drug therapy for CV outcomes at 5 years among patients with new-onset or untreated AF that required therapy.”

Something about this conclusion does not sit well with me. The CABANA trial results are not very strong evidence of superiority for ablation, but it’s also not very strong evidence of non-superiority; yet these results are often described as “negative” trials or using terminology like “Ablation is not superior…”

*Personal disclosure: I am generally a “less-is-more” person when it comes to medicine and think that the barrier for adopting an invasive or aggressive therapy should often be higher than simply beating a point null hypothesis. Even with the results of that trial, I would probably choose not to have ablation; that benefit is not enough to overcome my personal discomfort with undergoing an invasive procedure. It just does not sit well with me to see the trial reported with conclusions such as “Ablation is not superior…” which is technically right from one standpoint (after all, the data did not prove ablation to be superior), but also wrong from a different view (neither did the data prove that ablation was “not superior” - yet that is often how results of such studies are framed).

I do wonder, admittedly, what the trial statistician’s take was/is, and how the discussions went with their PI’s and clinical colleagues in the interpretation of such a result.


Wow, Andrew,I find that reporting cause for more than discomfort. Since this kind of reporting comes up constantly in my work, here’s my take and what I tell investigators:
It’s simply false reporting to say of a 95% interval of 0.65, 1.15
“The results of this important trial indicate that ablation is not superior to drug therapy for CV outcomes at 5 years among patients with new-onset or untreated AF that required therapy.”
And I mean that’s false, as in 2+2 = 5, albeit this type of false reporting seems the norm in the med literature. The trial “indicates” no such thing.
It flies in the face of all “replication crisis” hysteria that agonizes over false positives while barely or not even mentioning false negative reports like this one - where by “false negative” I mean in the ordinary English sense of misinterpreting an ambiguous outcome as a negative outcome instead of a shortcoming of the trial information.

An honest if optimistic description would have said instead something like
“The results of this trial are most compatible (at the 95% level) with an ablation effect relative to drug therapy of anywhere from a 35% rate reduction to 15% rate increase for adverse CV outcomes at 5 years among patients with new-onset or untreated AF that required therapy. Therefore, on the basis of these results alone we would refrain from making any recommendation about treatment choice…” [then proceed to discuss how it merges with other evidence].
Of course politically I expect it would be impossible to report that truth - that the trial does not supply enough information to derive an unambiguous judgment about even the direction of the effect.

It does enable one to construct a credible posterior distribution for the effect based on a credible prior distribution. But if there are other trials that supply information on the effect, what would be needed for sound recommendations is a research synthesis, including examination of costs and complications from each treatment choice. I find it hard to believe that anyone would have funded and approved this trial without some good background evidence about those effects. So for this trial report my first priority would be to show the estimates and raw numbers so that it could be easily entered into a synthesis.

I too am a “less is better” type (and there seems to be a current of that in the medical literature) but that’s our values speaking. I don’t think statisticians are hired to impose those values, although they do so often. Plus values are imposed unintentionally through sticking with unjustified alpha and power design parameters (alpha=0.05 and beta=0.20 as if automatically false positives are 4 times the cost of false negatives under the alternative? where did that come from?).

Ideally the values in play here should be those of the patient in consultation with physicians, assuming (often unrealistically) the patient is told and able to understand what is known of the risks and benefits of each treatment. Despite having an ambiguous outcome, this trial could contribute to the needed risk-benefit data via its treatment-specific risk estimates for adverse events (not just CV events) - a risk function of baseline participant characteristics.


I fully agree. I was quite annoyed, and actually took a good bit of flack from some “less is more” fanatics (see below for more on this) for daring to suggest that the lack of a p<0.05 result in CABANA was not consistent with statements such as “ablation not superior…” or “ablation no better than…” etc. The data were not strong enough to prove with high certainty that ablation does reduce the composite primary endpoint, but they also did not support the belief that it doesn’t reduce the composite primary endpoint. (Is that enough negatives for you?)

Would I personally choose to have ablation based on these data alone? Probably not. I’d want to see somewhat stronger evidence before agreeing to a procedure. If I were a regulator, and this were a new medical device up for approval, would I approve it based on these data alone? Again, probably not (even putting aside my personal “less is more” preference, the data are simply not strong enough to state with a high degree of certainty that the procedure improved the primary endpoint).

This gets to another of Frank’s points that designing fixed-N trials which have a high probability of concluding as “Maybe…?” is arguably unethical (after all, we randomized 2000+ patients to an invasive procedure versus medical therapy and are not much closer than before to knowing with much certainty whether ablation has a benefit over medical therapy in that population).

I agree - we should not impose those values; IMO, it is our job to provide the best design to answer the study question; analyze the data appropriately; and provide a dispassionate summary of the results, for the clinical world to decide how this treatment should be valued and applied.

Regardless of our personal preferences, it’s simply wrong to state that this trial means “ablation shows no benefit…” or even “patients that received ablation did no better than…” - actually, they did do better! Whether they did better by a large enough margin to “prove” that ablation is the winning strategy over the long run is the more difficult question to answer. @f2harrell has a good quote somewhere about another small-sample-size-and-therefore-inconclusive trial (this one of a drug) along the lines of “perhaps the funders were too embarrassed to admit that we were only able to rule out a massive benefit of the drug.”

I could even accept logic from the clinical community along the lines of “If it takes a trial with >2,000 patients randomized to suggest a relatively modest benefit of ablation, perhaps it should not be considered a first-line strategy until a patient has been on maximally tolerated medical therapy for X months” - while not explicitly stating that ablation is “not superior” they acknowledge that the therapy is not such a big winner that it must immediately be offered to all.

Sigh. I suspect you’re preaching to the choir on these boards. I wonder this with every grant application that begins using those default parameters rather than considering the specifics of the clinical question and relative cost of false positive versus false negative.

Fully agreed with this, and (IMO) again an underappreciated aspect of RCT’s in general. People need to look beyond the primary endpoint’s p-value to appreciate full benefits of a well-conducted RCT in understanding therapuetic benefit and harms.


Remarkable how we can all agree about the problems and yet not budge standard interpretational malpractice in the medical literature, e.g., JAMA routinely rejects attempts to point out this problem even via letters, and what little gets through is editorially watered down. I am left with the sense that the medical community is on the whole on the level of congress in terms of hiding and denying error, and in reporting “facts” to fit foregone conclusions. If anyone has a feasible positive proposal to effect change I’d love to hear about it.


I’m likely stating the obvious, but this resistance probably stems from poor statistical literacy among journal editors. And their fear of promoting views they don’t fully grasp


The emphasis on “statistical significance” is a major component, of course. People naturally think that the inverse of a “statistically significant” positive finding is a NEGATIVE finding, so we get tortured phrases such as “not superior” and the like to describe any result that is not significant


On the general topic, let me call everyone’s attention to this recent post by Danielle Navarro (sent to me by Valentin Amrhein):

  • we both especially liked the closing section, “What to Do?”:
    “Honestly, I don’t know. I like Bayesian inference a great deal, and I still find Bayes factors useful in those circumstances where I (1) trust the prior, (2) understand the models and (3) have faith in the (either numerical or analytic) approximations used to estimate it. I don’t have a better alternative, and I’m certainly not going to countenance a return to using p-values7. More than anything else, the one thing I don’t want to see happen is to have the current revival off Bayesian methods in psychology ossify into something like what happened with p-values.
    What I think happened there is not necessarily that p-values are inherently useless and that’s why our statistics went bad. Rather, it’s that introductory methods classes taught students that there was A RIGHT WAY TO DO THINGS and those students became professors and taught other students and eventually we ended up with an absurd dogs breakfast of an inference system that (I suspect) even Fisher or Neyman would have found ridiculous. If I’ve learned nothing else from my research on cultural evolution and iterated learning it’s that a collection of perfectly-rational learners can in fact ratchet themselves into believing foolish things, and that it’s the agents with most extreme biases that tend to dominate how the system evolves.
    Whatever we do with Bayesian methods, whatever role Bayes factors play, whether we use default or informed priors, the one thing I feel strongly about is this… we should try to avoid anything that resembles a prescriptive approach to inference that instructs scientists THIS IS HOW WE DO IT and instills in them the same fear of the Bayesian gods that I was once taught to have for the frequentist deities.
    It doesn’t help anyone, and it makes science worse.”


Sigh. Bolding for emphasis.


Just to be clear because the formatting seemed to fail: All that I posted below the Navarro link is from the link and are Navarro’s words (I could not have written better). I myself would bold this part:
a collection of perfectly-rational learners can in fact ratchet themselves into believing foolish things, and that it’s the agents with most extreme biases that tend to dominate how the system evolves.”
Her observation seems to me to describe well the way NHST came to become the dominant yet mindless ritual that has corrupted so much of the scientific literature with a false “false positive”/“false negative” dichotomy (as has been said, the best way to avoid false positives and false negatives is to stop dichotomizing complex and near-continuous results into “positives” and “negatives”). It warns us that the same thing will happen with alternatives like Bayes factors if we are not vigilant in keeping dichotomania, nullism, and reification out of their use.


Reading the original post led me to Malcolm Rorty’s original paper, referenced therein. It contains an account of the attempts that he and his team made to recruit statisticians. They had, as a result, to define what exactly constituted a statistician. At the end of a lot of testing, they concluded thus :

In the final development of our examination, it became composed entirely of non-technical questions [of everyday reasoning]. We discovered, in fact, that it was of small importance whether a candidate could, or could not, work out a coefficient of correlation, provided he knew instinctively when to distrust one. And in the end we found ourselves placing technically trained statisticians, mathematicians, and actuaries under the supervision of skilled logicians-of men who might have to learn the details of statistical technique, but who started with an instinctive knowledge of the scientific method as a whole

This, then, was our pragmatic answer to the question, "What is a statistician? "-the statistician must be instinctively and primarily a logician and a scientist in the broader sense, and only secondarily a user of the specialized statistical techniques. The statistician who knew only statistics was a danger and a blight. It was less important to know how to use statistics, than it was to know how and when not to use them.

The paper is a striking example of the mismatch between statistical training and practical statistics that is still a problem today.