Language for communicating frequentist results about treatment effects

journal
confidence-interval
p-value
writing
rct

#21

I can’t disagree, but in my opinion when foundational problems with a method lead to the majority of those using it to misuse it, I don’t care so much whether it’s the idea or the people. This is related to Don Berry’s story on why he became Bayesian after failing to teach bright statistics graduate students what a confidence interval really means. He concluded that the foundation is defective.

An excellent point. Richard McElreath in his book Statistical Rethinking made the point by always using 0.88 or some such credible intervals, instead of 0.95. We use 0.95 as a knee jerk default and don’t think often enough of its arbitrariness. At least for normal-theory methods you can derive an interval for any confidence level if you know the 0.95 interval. I caution researchers to not use “does the interval contain zero” thinking, but the temptation is real.

As you know I’d go with Bayesian posterior inference, but I think your statement goes too far. Confidence intervals have problems but are pragmatic solutions that at least move in the right direction and contain far more information that p-values.

Very true, but IMHO does not fully cover the foundational pitfalls.

I don’t dislike this in any sense—I’m just not used to it enough yet. I use Shannon information when teaching the pitfalls of dichotomization of measurements (e.g., “hypertension” has 1 bit of information, and systolic blood pressure as usually measured has 4).

Thanks for the thought provoking interchange.


#22

Frank, as an example of what I mean, take a look at the article by Brown et al. JAMA 2017;317:1544, whose abstract reports the following results:
“adjusted HR [Cox-model hazard ratio] 1.59 [95% CI, 1.17-2.17]). After inverse probability of treatment weighting based on the high-dimensional propensity score, the association was not significant (HR, 1.61 [95% CI, 0.997-2.59])”
from which it offers in the conclusion:
“in utero serotonergic antidepressant exposure compared with no exposure
was not associated with autism spectrum disorder in the child.”
I see this kind of nonsense all the time in leading medical journals. I conclude that confidence intervals are no better than significance tests for stopping serious statistics abuse. And I have seen Bayesian tests and intervals in these sorts of venues get abused in exactly the same sort of way. This is a psychosocial issue that statisticians have shown themselves ill-equipped to deal with in trying to shift the problem to some sort of foundational question. Why? Because it’s not a statistical issue, it’s a psychosocial one stemming from people and the system they’ve created. That’s why changing methods will only move this inference dustpile around under the statistical rug - people will still find ways to distort information using biased narrative descriptions (whether to hide unwanted results or trumpet dubious discoveries). Mitigation will take better reviewing and monitoring of narratives (not just methods) than is currently the norm (and that will be fought hard by those most committed to past distortions, like JAMA).


#23

There is only one thing that I disagree with in this post: " p-values may be dispensed with when frequentist analysis is used". Since the easiest way to calculate what matters, the (minimum) false positive risk, is from the p value, if the p value is not stated, one has to extract it from the confidence interval. That’s easy but irritating.
https://arxiv.org/abs/1802.04888


#24

That is indeed a wonderful teaching example (and a terrible example of statistical thinking). This speaks to your emphasis on untying confidence intervals from whether they contain the null. I could use one of your earlier arguments to conclude that it’s not necessarily the foundations but the poor use in practice that we’re seeing in this example (but I believe that both are happening here).

I do feel that changing the narrative makes things better. Sometimes you have to change a system to get people to wake up. If going Bayes though we need to go all the way in this sense: as you’ve said, a 0.95 credible interval still required the arbitrary 0.95 choice. Bayesians prefer to draw the entire posterior distribution. My own preference is to show the posterior density superimposed with subjective probabilities of clinical interest: probability of any efficacy, probability of clinical similarity, probability of “big” efficacy. Except for the probability of any efficacy, these require clinical judgments (separate from priors, data model, and other posterior probabilities) to define values to request posterior probabilities for.


#25

Just one addition: Sander’s teaching example is discussed in our new paper (Amrhein, Trafimow & Greenland) https://peerj.com/preprints/26857


#26

Frank: Bear in mind that a frequentist can parallel every presentation and narrative maneuver of a Bayesian, and vice versa. Embedding both in a hierarchical framework there is a straight mapping between them (Good, Am Stat 1987;41: 92). In your description you could instead show the corresponding penalized-likelihood function or penalized P-value (“confidence”) function with the points of clinical interest called out. Either would achieve the same goal of helping change the narrative toward a continuous one, away from the dichotomania that has dominated so much of the research literature since Fisher’s heyday. This shows that the issue is not foundational or mathematical, it is presentational and thus cognitive.

The resistance I have encountered in medical-journal venues to such change is staggering, with editors attempting to back-rationalize every bad description and practice they have published (often appealing to anonymous “statistical consultants” for authoritative validation). Perhaps that’s unsurprising since to see the problem would be to admit that their policies have led to headlining of false statements (and thus harming of patients) for generations. The only idea I have for effecting change is to begin prominent publicization of each specific error (such as the Brown example) in the hopes that external outrage and protest accumulates enough to start having an impact.


#27

Sander it’s probably best to keep that disagreement to another topic. Just one example to show the depth of the disagreement: frequentists would have a very hard time of doing the equivalent of computing the posterior probability that at least 3 of 5 outcomes are improved by a treatment, where ‘improved’ means a positive difference in means between treatments.

I don’t doubt this in the least. But I still want to resist it. One highly regarded statistician (I forget who) stated that he will not allow his name to be on a paper that uses the phrase “statistical significance”. I take various strong stands as a frequent reviewer for medical journals, and find myself frequently disagreeing with clinician reviewers on statistical issues. For Bayes it’s a cart or the horse issue sometimes. Until its used more often, journal editors will resist.

Besides publicizing errors the usual way, a little twitter shame can help, plus journal club blogs. I think the low hanging fruit is the use of cutoffs for anything, and the ‘absence of evidence is not evidence of absence’ error.


#28

Frank: This disagreement is central to the topic heading here as well as to its sister topic of communicating Bayesian results. In all this I am surprised by your failure to see the frequentist-Bayes mapping, which is key to proper use and description of both types of statistics. Please read Good (1987 at least; it’s one page).

If we treat the hierarchical model used for your Bayesian odds (prior+conditional data model) as a two-stage (random-parameter) sampling model, we can compute a P-value for the hypothesis that the treatment improves at least 3 of 5 outcomes. This computation can use the same estimating function as in classical Bayesian computation, where the log-prior is treated as a penalty function added to the loglikelihood. The penalized-likelihood-ratio P-values are then Laplace-type approximations to posterior probabilities (and excellent ones in regular GLMs). But even better the mapping provides checks on standard overconfident Bayesian probabilities such as “the” posterior probability that at least 3 of 5 outcomes are improved by treatment.

First, the mapping provides a cognitive check in the form of a sampling narrative: If treating the prior as a parameter sampler (or its log as a frequentist penalty) looks fishy, we are warned that maybe our prior doesn’t have a good grounding in genuine data. Such narrative checks are essential; without them we should define a Bayesian statistician as a fool who will base analyses on clinical opinions from those whose statistical understanding he wouldn’t trust for a nanosecond and whose beliefs have been warped by misreporting of the type seen in Brown et al. (which was headlined in Medscape as showing “no link”).

Going “full-on Bayes” is dangerous, not only when it fails to check its assumptions within a contextual sampling narrative, but when it misses Box’s point that there are frequentist diagnostics for every Bayesian statement. In your example a posterior odds of at least 3:2 on improvement would call for diagnostics on the (prior+data) model used to make that claim, including a P-value for the compatibility of the prior with the likelihood (not a “posterior predictive P-value”, which is junk in frequentist terms).

Frequentist results are chronically misrepresented and hence miscommunicated. Yes their Bayesian counterparts can be phrased more straightforwardly, but that’s not always an advantage because their overconfident descriptions are harder to see as misleading. “The posterior probability” is a prime example because there is no single posterior probability, there is only a posterior probability computed from a possibly very wrong hierarchical (prior+data) model. Similarly there is no single P-value for a model or hypothesis.

How is this relevant to proper interpretation and communication? Well, it would improve frequentist interpretations if they recognized that the size of a computed P-value may reflect shortcomings of the model used to compute it other than falsity of the targeted hypothesis. The hierarchical mapping tells us that Bayesian interpretations would also be improved if they recognized that the size of a posterior probability may only reflect shortcomings of the underlying hierarchical (prior+sampling) model used to compute it rather than a wise bet about the effect.

In my view, providing only a posterior distribution (or P-value function) without such strong conditioning statements is an example of uncertainty laundering (Gelman’s term), just as is calling P-values “significance levels” or intervals summarizing them “confidence intervals.” And I think it will lead to much Bayesian distortion of the medical and health literature. Even more distortion will follow once researchers learn how to specify priors to better ensure the null ends up with high or low posterior probability (or in or out of the posterior interval). Hoping to report the biggest effect estimate you can? Use a reference prior or Gelman’s silly inflated-t(1)/Cauchy prior on that parameter. Hoping instead to report a “null finding”? Use a null spike or a double exponential/Lasso prior around the null. Via the prior, Bayes opens up massive flexibility for subtly gaming the analysis - in addition to the flexibility in the data model that was already there (and which Brown et al. exploited along with dichotomania in switching from a Cox model to HDPS for framing their foregone null conclusions).


#29

Sander,
I’m glad you pushed forward on Frank’s response. I was about to chime in with a less elegant response, though underscoring the gravity of the queries you present as they have real life consequences on patients.


#30

As a clinician educator, I agree with this. I dont think we can avoid these terms, so best to teach correct use.

I will say, that I think its irresponsible to use the term “significant” in isolation. Articles should always clarify if they are referring to “statistical significance” or “clinical significance.”

I do appreciate the term “clinical significance” may have multiple interpretations. I favor the use of “clinical significance” as “the practical use of a result”, which is often a combination of effect size and minimally important differences. As such, I think many commentaries on “clinical significance” would benefit from prespecified thresholds/bounds and equivalence testing.


#31

Raj & Zad: For the record, I disagree completely with retaining any usage of “statistical significance” - it’s misleading in so many ways relative to ordinary English and intuition, so much so that it should be treated like the “n” word, a historical shame. “Significance” describes nothing about P-values accurately or correctly, and needs to be reclaimed for its ordinary English practical meaning. Even more so for “confidence”, which so misleading compared to ordinary English as it is no basis for confidence except under circumstances far more idealized than realized. A “confidence interval” does nothing more than show two points on a cross-section (one-parameter slice) of a P-value function.

These bad word choices are in my view the most destructive legacy of the otherwise statistical giants Fisher and Neyman - in fairness to them, they could not have anticipated the disaster that would ensue from mere words, as they were statisticians, not language philosophers. Please reconsider your remarks in light of these considerations.


#32

Sander, what language would you prefer to use in research settings where the primary purpose of the analysis is to make a decision based on a comparison of observed p to a justified α?

For example, in your recent TAS paper on p-value behavior, you state the following,

"It may be argued that exceptions to (4) arise, for example when the central purpose of an analysis
is to reach a decision based solely on comparing p to an α, especially when that decision rule has
been given explicit justification in terms of false-acceptance (Type II) error costs under relevant
alternatives, as well as false-rejection (Type I) error costs (Lakens et al. [2018]), as in quality control
applications."

I think I’m absolutely fine renaming “confidence intervals” as “compatibility intervals” or as Gelman often writes, “uncertainty intervals.”


#33

I am currently considering writing comments and short notes to journals in my field (ecology and animal behavior) – it is so easy to spot the low hanging fruit, most often as proofs of the null hypothesis. If everyone would do this in their field, in a kind of concerted action, perhaps we would have a real chance to change something?


#34

Sander, what language would you prefer to use in research settings where the primary purpose of the analysis is to make a decision based on a comparison of observed p to a justified α?

In much of cell biology (basic biomedical but far from clinical), researchers state alpha up front but then do not seem to actually use it as a hard decision rule. In comparing contemporary papers with those from before 2000 (and especially before 1990) the things that seem to guide decision making are no different today than 30 years ago. The difference is that in today’s papers stuff is more quantified in a way to do a t-test or anova (like the results of various blots) and there are 100s of reported n.s., *, **, and *** (and even ****) per paper, compared to no asterisks and no t-tests (or maybe 1 or 2) 30 years ago. Even today, the “decisions” that the researchers are making do not seem to be guided much by the statistics. Basically, if it looks like an effect (based on the plot of the data), and the direction is in a direction of their working model of how the system works, they probe a little deeper with different experiments.

So even though these researchers state an alpha, in a sense, the practice is a bit more like Fisher than N-P (since they are not using p-values as hard decision rules) but instead of reporting p-values, they’ve quart- or quint-chotomized the p-value into n.s., *, **, ***, and ****.

Back to the original question - most of this is irrelevant in cell biology because a typical paper reports > 100 p-values or sets of asterisks and for the most part, they are simply reported and not explicitly interpreted. So how could the reporting in cell biology papers be improved?

  1. drop the initial statement of alpha
  2. report p-value and not asterisk
  3. drop all uses of the word “significant”

#35

Bayesian-frequentist mapping is only available in specific situations, and even when the mapping is simple I don’t like to think this way because frequentists do not like the Bayesian interpretation.

Frequentist methods cannot handle this in general, because hierarchical models do not cover the full spectrum of models. Compare for example with copula multivariate models for multiple outcomes of different types, where the types cannot be connected by a common random effects distribution without at least having scaling issues.

I think our fundamental disagreement, which will not be fixable no matter how long the discussion, is that you feel you are in a position to question any choice of prior distribution, and you are indirectly stating that a prior must be “right”. I only believe that a prior must be agreed upon, not “right”, and I believe that even a prior that is a bit off to one reviewer will most often provide a posterior distribution that is defensible and certainly is more actionable.

Your point about Bayesian analysis depending on prior + model is of course correct. But Bayes in total makes fewer assumptions than frequentist. That’s because frequentist requires a whole other step to translate evidence about data to optimal decisions about parameters. This requires many assumptions about sample space that are more difficult to consider than changing a Bayesian prior. But let’s stick to optimizing frequentist language here for now.


#36

I’d walk this back a even further. Whether the prior is “agreed upon” is far less important than that it constitutes objective knowledge which is therefore criticizable. That is, an explicit prior enriches and enlarges the critical, scientific discussion we can have together. I suspect that one reason this convo is not moving toward resolution is that it has been framed through a question about alternative ways to make statements that have precisely the opposite character—i.e, statements about statistical ‘results’ that are meant to close off critical discourse!


#37

This is brilliant and extremely useful. Just one suggestion. In the negative case, you could make the correct statement more general by expanding this sentence:
“More data will be needed.”
->
“More data, or more accurate data or more relevant data will be needed.”

  • accurate data to emphasise measurement precision
  • relevant data to emphasise the link between data and the hypothesis being tested

#38

David I agree with that for the majority of cases. But not all. I think that evidence that is intended to convince a skeptic can validly use a skeptical prior that incorporates no prior information. This is essentially Spiegelhalter’s approach. I do want to heartily endorse the criticizable part.

For sure - and I think that claiming ‘statistical significance’ or lack thereof in a medical paper is a great way to shut down thinking!


#39

Zad: In that situation how about reporting the decision rule with α in the methods section, then the P-value and decision in the results, as recommended (for example) by Lehmann in the mid-20th century bible of NP theory, Testing Statistical Hypotheses? Many authors in that (my grandparents!) generation used the term “significance” to describe p (Fisher, Cox) and others (like Lehmann) used it to describe α, so presciently enough Neyman when I knew him avoided using the word for either. I avoid it, Frank avoids it, and anyone can: p is the P-value, α is the alpha-level, and the decision is reported as “rejected by our criterion” or “not rejected by our criterion” (please, not “accepted”!). Even in that decision situation, Lehmann, Cox and many others advised reporting p precisely, so that a reader could apply their own α or see how sensitive the decision was to α.

As for “uncertainty interval”, sorry but no, because:

  1. that is already used by some Bayesians to describe posterior intervals, and rightly so because
  2. it’s yet another misuse of a word: “uncertainty” is a subjective observer state, whereas the P-value and whether it is above or below α (as shown by the interval for multiple points) is just a computational fact, regardless of anyone’s uncertainty; in fact
  3. no interval estimate I see in my field captures anywhere near the uncertainty warranted (e.g., warranted by calibration against a credible model) - they leave out tons of unknown parameters. That means those “left out” parameters have been set to default (usually null) values invisibly to the user and consumer. (In my work there is typically around 9 unknown measurement-error parameters per association; only on rare occasions do even a few get modeled.)

At best a compatibility (“confidence”) interval is just a summary of uncertainty due to random error alone which is to say the conditional uncertainty left after assuming the model used to compute the interval is correct (or wrong only in harmless ways). But that’s an assumption I would never believe or rely on in my work, nor in any example from Gelman’s area.

My conclusion is that using “uncertainty” in place of “confidence” is just substituting a bad choice for a horrific one.


#40

So I’m assuming from your response that you’d be fine with describing them as compatibility intervals?