Language for communicating frequentist results about treatment effects

f2harrell · November 14, 2018, 3:22am

I don’t want to say “significant” anywhere. It implies a p-value cutoff and is unclear (e.g., doesn’t distinguish statistical from clinical significance). And “we do not have enough data” is relative. The confidence interval tells you how much data you have so I don’t feel we need to add language to that. “Much uncertainty” could also be called relative and I’d rather let the interval speak.

I would use the confidence interval for that, without having to do a different statistical test.

mwebb · November 14, 2018, 5:44am

It’s incorrect to interpret the confidence interval as the probability that B-A is between the lower and upper limit of the interval. That’s stats 101.

But isn’t the confidence interval equivalent to the Bayesian credible interval with a flat prior? At least asymptotically? From this perspective, isn’t the above interpretation of the confidence interval sort of correct?

simongates · November 14, 2018, 8:09am

May be stats 101 but almost nobody understands this.

2nd point; yes, sort of. The numbers may be the same but the interpretation isn’t, so I wouldn’t say they are equivalent.

simongates · November 14, 2018, 8:53am

My modest contribution: https://osf.io/725sz/
Incorrect statements abound!

f2harrell · November 14, 2018, 12:55pm

That’s my take. Since frequentists don’t believe that probabilities are associated with model parameters, whether you can find special cases of equivalence doesn’t matter very much. You have to go “all in” with Bayes in order to get the interpretation clinicians seek, e.g. the probability is 0.95 that the unknown treatment effect is between -5 and 13mmHg. Also note that the special case breaks down entirely if the sample size is not fixed in advance or if there is more than one look at the data. Bayesians have no way nor need to adjust for multiplicity of this kind.

Simon I’m so glad to be reminded about this super applicable paper.

Bob_Calin-Jageman · November 14, 2018, 5:52pm

For a 95% CI, how about the straightforward interpretation of a prediction interval:

“There is an 83% chance a replication with the same N will obtain a result within this interval”

To my mind, this is the most useful interpretations. Note the capture rate is not 95% because both estimates are subject to sampling error. I think that means that as the sample-size for the replication grows towards infinity the expected capture rate grows toward the confidence level.

Reference is Cumming & Maillardet, 2006, http://psycnet.apa.org/buy/2006-11159-001.

I’m curious if you object to these or not. I’m also curious if you think the same statements can be made about Bayesian credible intervals. I can’t quite wrap my head around it, but I think Bayesian credible intervals are not prediction intervals,

f2harrell · November 14, 2018, 9:23pm

I don’t feel that will be quite satisfying for the majority of collaborations. And I think that evidentiary statements need to be based on the parameter space, not the data space.

For Bayes this is quite clean: draw samples from the posterior predictive distribution. But again, the data space is not really what’s called for IMHO.

Bob_Calin-Jageman · November 14, 2018, 9:55pm

I’m sorry, but I’m not sure if I understand the response.

For myself, as a scientist, I am quite happy to use CIs as prediction intervals–they help calibrate my expectations of what to expect if my lab (or another) replicates the procedure. When the CI is very long, I must sadly admit that the replication could provide almost any outcome and that my current work provides almost no guidance in what to expect. When the CI is very short I can make a relatively precise prediction that is helpful for planning the replication and interpreting the outcome. I guess every scientist is different, but I want nothing more than to properly calibrate my expectations about future research outcomes.

Could you expand a bit on the Bayesian credible interval as a prediction interval? I know Bayesian’s don’t define probability in terms of frequency. So does a credible interval have any meaningful interpretation in terms of expected frequencies of replication outcomes? Does the qualities of a credible interval as a prediction interval depend strongly on the prior?

f2harrell · November 14, 2018, 11:34pm

I think that’s a very good use of the prediction interval. But I don’t see it as useful for making an inference about the unknown effect.

Right, Bayesian methods don’t use the frequency interpretation of probability very often. I wasn’t speaking of using a credible interval for a prediction interval. A credible interval is a posterior distribution interval, which is a statement about an unknown parameter. On the other hand, you can compute the Bayesian posterior predictive distribution to simulate new data or new group means based on the current signal about the unknown parameters, and the prior.

Madhu · November 15, 2018, 4:43am

This is a very insightful and helpful thread. I wanted to jump in with a quick question which I often encounter. In the drug trial for BP reduction, the effect for B-A was, say, 3 mmHg with a 0.95 confidence interval of (-1, 6) and a p-value of 0.06. How should we interpret this? Does the interpretation that “we did not find evidence against the null hypothesis of A=B” still hold in this case? What if the point estimate was 6 mmHg, 0.95 CI (-1, 13) and p-value 0.06? Thank you!

Sander · November 15, 2018, 5:35am

For the record, I disagree with a lot of what is given here.
To start, I don’t think this statement is perfectly accurate:
“We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size.”
The statement claims zero evidence against the tested hypothesis, whereas I would say instead the p=0.4 constitutes almost no evidence against that hypothesis (by the Shannon surprisal measure I use, p=0.4 supplies s = log2(1/0.4) = 1.3 bits of information against the hypothesis - barely more information than in one coin toss). The sample size addition is unnecessary if harmless, as p combines (some would say confounds) sample size and estimate size and a larger estimate would have likely produced a smaller p at the same sample size (only “likely” because exceptions can occur when the standard error inflates faster than the estimate at a given sample size).

Conversely, I think it is misleading to say “The study found evidence against the hypothesis that A=B (p=0.02)” because there is no magic point at which P-values start to “find evidence” against hypotheses. Instead one could say “The study found log2(1/.02) = 5.6 bits of information against the hypothesis that A=B.” The ultimate goal is to get away from dichotomies like “negative” and “positive” trials - those are all experimental tests of the hypothesis, and their P-values measure only the amount of information each supply against that hypothesis.

I might agree with the sentiment behind “As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B”, but it also seems wrong as stated because

it seems to confuse the pre-specification problem with the fact that P-values do not (in isolation, at least) measure support, period, only degree of conflict (incompatibility) between models and data, and
a frequentist can assess similarity indirectly by specifying a similarity (equivalence) interval and seeing whether there is more than a given amount of evidence against the true difference being outside that interval. Thus I would restate it as something like “As the statistical analysis plan did not specify an interval of equivalence, we did not assess similarity of A and B.”

I also think this statement commits a potentially important if standard mistake of omission:
“Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A.” It needs amendment to something like
“Assuming the study’s sampling scheme, experimental design, and analysis protocol, the probability is 0.4 that another study would yield a test statistic for comparing two means that is as or more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A and all statistical modeling assumptions used to get p are correct or harmless.” That’s because almost all statistics I see in my areas are derived from regression models (whether for the outcome, treatment, or both). Sorry if the added conditions seem nit-picky, but it is not hard to find examples where their failure has a non-negligible effect.

Next, I think this statement is commonly believed and simply wrong from an information perspective: " we cannot interpret the interval except for its properties in the long run." No: As noted just below that statement, the 95% interval can be interpreted for this single data analysis as the interval of all parameter values for which p>0.05. Thus the interval shows the parameter values for which the data supply less than log2(1/0.05) = 4.3 bits of information against them, given the model used to compute the interval. This interpretation does rely on the repeated-sampling property that, given the model, the random P across studies is uniform at the correct value; this property ensures that the S-value captures the refutational information in the test statistic (note that posterior predictive P-values are not uniform and thus do not sustain this interpretation).

Finally, if not clear from the above, I disagree that P-values are wisely dispensed with in favor of confidence intervals. Confidence intervals invite the distortion of 0.05-level dichotomization. The problem with P-values is that almost no one computes them for more than the null hypothesis. They should instead be given not only for the targeted “null” but also at least for one more hypothesis, including the protocol-specified alternative used to compute power or sample size.

zad · November 15, 2018, 7:07am

I believe @Sander raises some important points, especially this,

The common way to mention p is to discuss the assumption of the null hypothesis being true, but few definitions mention that every model assumption used to calculate p must be correct, including assumptions about randomization (assignment, sampling), chance alone, database errors, etc.

and also this point,

I don’t believe abandoning p values is really helpful. Sure they are easy to misinterpret, but that doesn’t mean we abandon them. Perhaps instead, we can encourage the following guidelines:

thinking of them as continuous measures of compatibility between the data and the model used to compute them. Larger p = higher compatibility with the model, smaller p= less compatibility with the model
converting them into S values to find how much information is embedded in the test statistic computed from the model, which supplies information against the test hypothesis
calculate the p value for the alternative hypothesis, and the S value for that too

zad · November 15, 2018, 7:16am

I also can’t say I agree with the ideas here,

If we are to be honest, the concepts of statistical power, statistical significance, and hypothesis testing are likely not going to disappear in favor of estimation, especially in many experimental fields, so I don’t believe it’s very helpful that we abandon the usage of these phrases.

Better that we emphasize teaching what statistical significance means, and how it differs from something clinically significant etc. The meanings and interpretations of statistical significance and non significance should be taught rather than abandoned out of fear of misinterpretation. Attempting to avoid such usage may actually backfire in the long run

f2harrell · November 15, 2018, 12:58pm

The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit. So we’ll have to strongly disagree about this. I agree with many of Sander’s other points, but this one, though technically correct, is not as helpful to the researcher as it appears because it is not phrased in terms of the underlying inference/research question:

This comment is not quite on the mark in the original context because I was meaning to address the most common situation in which researchers and journals are misusing efficacy tests and do not have a similarity or non-inferiority test in mind:

The statement Sander makes with which I agree the most is the idea that there is no cutoff for when evidence is present vs. absent. I need to change the language to reflect this.

Over the next days I plan to edit the original post to reflect some, but not all, of the criticisms. This will be an incremental process as we seek to find acceptable wording that is short enough to actually be included in an article’s abstract.

Sander · November 15, 2018, 6:33pm

“The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit.” No, the misuse of P-values as “significance tests” is what causes the damage, especially when they are applied to only one hypothesis (this misuse was an “innovation” of Karl Pearson and Fisher; before them “significance” referred to a directional Bayesian tail probability, not a data-tail probability).

Neymanian confidence limits show the two points at which p=0.05, which is fine if one can picture how p smoothly varies beyond and between those points. But few users seem able to do that. Plus 95% confidence intervals have the disadvantage of relying on the misleading 0.05 dichotomy whereas at least P-values can be presented with no mention of it. No surprise then that in the med literature I constantly see confidence intervals used only as means to mis-declare presence or absence of an association. This abuse is aggravated by the overconfidence produced by the horrendous adjective “confidence” being used to describe what is merely a summary of a P-value function, and thus only displays compatibility; for that abuse of English, Neyman is as culpable for the current mess as Fisher is for abuse of the terms “significance” and “null”.

The point is that confidence intervals are just another take on the same basic statistical model, and as subject to abuse as P-values. Thus if one can’t “redeem” P-values then one should face up to reality and not “redeem” confidence intervals either. That’s exactly what Trafimow & Marks did (to their credit) when they banned both p-values and confidence intervals from their journal. Unfortunately for such bold moves, the alternatives are just as subject to abuse, and some like Bayesian tests based on null-spiked priors only make the problems worse.

The core problems are however created by erroneous and oversimplifying tutorials, books (especially of the “made easy” sort, which seems a code for “made wrong”), teachers, users, reviewers, and editors. Blaming the methods is just evading the real problems, which are the profound limits of human competency and the perverse incentives for analysis and publication. Without tackling these root psychosocial problems, superficial solutions like replacing one statistical summary with another can have only very limited impact (I say that even though I still promote the superficial solution of replacing P-values with their Shannon information transform, the S-value -log2(p) which you Frank continue to ignore in your responses).

f2harrell · November 15, 2018, 9:19pm

I can’t disagree, but in my opinion when foundational problems with a method lead to the majority of those using it to misuse it, I don’t care so much whether it’s the idea or the people. This is related to Don Berry’s story on why he became Bayesian after failing to teach bright statistics graduate students what a confidence interval really means. He concluded that the foundation is defective.

An excellent point. Richard McElreath in his book Statistical Rethinking made the point by always using 0.88 or some such credible intervals, instead of 0.95. We use 0.95 as a knee jerk default and don’t think often enough of its arbitrariness. At least for normal-theory methods you can derive an interval for any confidence level if you know the 0.95 interval. I caution researchers to not use “does the interval contain zero” thinking, but the temptation is real.

As you know I’d go with Bayesian posterior inference, but I think your statement goes too far. Confidence intervals have problems but are pragmatic solutions that at least move in the right direction and contain far more information that p-values.

Very true, but IMHO does not fully cover the foundational pitfalls.

I don’t dislike this in any sense—I’m just not used to it enough yet. I use Shannon information when teaching the pitfalls of dichotomization of measurements (e.g., “hypertension” has 1 bit of information, and systolic blood pressure as usually measured has 4).

Thanks for the thought provoking interchange.

Sander · November 15, 2018, 10:00pm

Frank, as an example of what I mean, take a look at the article by Brown et al. JAMA 2017;317:1544, whose abstract reports the following results:
“adjusted HR [Cox-model hazard ratio] 1.59 [95% CI, 1.17-2.17]). After inverse probability of treatment weighting based on the high-dimensional propensity score, the association was not significant (HR, 1.61 [95% CI, 0.997-2.59])”
from which it offers in the conclusion:
“in utero serotonergic antidepressant exposure compared with no exposure
was not associated with autism spectrum disorder in the child.”
I see this kind of nonsense all the time in leading medical journals. I conclude that confidence intervals are no better than significance tests for stopping serious statistics abuse. And I have seen Bayesian tests and intervals in these sorts of venues get abused in exactly the same sort of way. This is a psychosocial issue that statisticians have shown themselves ill-equipped to deal with in trying to shift the problem to some sort of foundational question. Why? Because it’s not a statistical issue, it’s a psychosocial one stemming from people and the system they’ve created. That’s why changing methods will only move this inference dustpile around under the statistical rug - people will still find ways to distort information using biased narrative descriptions (whether to hide unwanted results or trumpet dubious discoveries). Mitigation will take better reviewing and monitoring of narratives (not just methods) than is currently the norm (and that will be fought hard by those most committed to past distortions, like JAMA).

DavidColquhoun · November 16, 2018, 12:20pm

There is only one thing that I disagree with in this post: " p-values may be dispensed with when frequentist analysis is used". Since the easiest way to calculate what matters, the (minimum) false positive risk, is from the p value, if the p value is not stated, one has to extract it from the confidence interval. That’s easy but irritating.
https://arxiv.org/abs/1802.04888

f2harrell · November 16, 2018, 1:01pm

That is indeed a wonderful teaching example (and a terrible example of statistical thinking). This speaks to your emphasis on untying confidence intervals from whether they contain the null. I could use one of your earlier arguments to conclude that it’s not necessarily the foundations but the poor use in practice that we’re seeing in this example (but I believe that both are happening here).

I do feel that changing the narrative makes things better. Sometimes you have to change a system to get people to wake up. If going Bayes though we need to go all the way in this sense: as you’ve said, a 0.95 credible interval still required the arbitrary 0.95 choice. Bayesians prefer to draw the entire posterior distribution. My own preference is to show the posterior density superimposed with subjective probabilities of clinical interest: probability of any efficacy, probability of clinical similarity, probability of “big” efficacy. Except for the probability of any efficacy, these require clinical judgments (separate from priors, data model, and other posterior probabilities) to define values to request posterior probabilities for.

Valentin_Amrhein · November 16, 2018, 1:10pm

Just one addition: Sander’s teaching example is discussed in our new paper (Amrhein, Trafimow & Greenland) https://peerj.com/preprints/26857