Language for communicating frequentist results about treatment effects

Sander · November 15, 2018, 6:33pm

“The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit.” No, the misuse of P-values as “significance tests” is what causes the damage, especially when they are applied to only one hypothesis (this misuse was an “innovation” of Karl Pearson and Fisher; before them “significance” referred to a directional Bayesian tail probability, not a data-tail probability).

Neymanian confidence limits show the two points at which p=0.05, which is fine if one can picture how p smoothly varies beyond and between those points. But few users seem able to do that. Plus 95% confidence intervals have the disadvantage of relying on the misleading 0.05 dichotomy whereas at least P-values can be presented with no mention of it. No surprise then that in the med literature I constantly see confidence intervals used only as means to mis-declare presence or absence of an association. This abuse is aggravated by the overconfidence produced by the horrendous adjective “confidence” being used to describe what is merely a summary of a P-value function, and thus only displays compatibility; for that abuse of English, Neyman is as culpable for the current mess as Fisher is for abuse of the terms “significance” and “null”.

The point is that confidence intervals are just another take on the same basic statistical model, and as subject to abuse as P-values. Thus if one can’t “redeem” P-values then one should face up to reality and not “redeem” confidence intervals either. That’s exactly what Trafimow & Marks did (to their credit) when they banned both p-values and confidence intervals from their journal. Unfortunately for such bold moves, the alternatives are just as subject to abuse, and some like Bayesian tests based on null-spiked priors only make the problems worse.

The core problems are however created by erroneous and oversimplifying tutorials, books (especially of the “made easy” sort, which seems a code for “made wrong”), teachers, users, reviewers, and editors. Blaming the methods is just evading the real problems, which are the profound limits of human competency and the perverse incentives for analysis and publication. Without tackling these root psychosocial problems, superficial solutions like replacing one statistical summary with another can have only very limited impact (I say that even though I still promote the superficial solution of replacing P-values with their Shannon information transform, the S-value -log2(p) which you Frank continue to ignore in your responses).