Language for communicating frequentist results about treatment effects

I’ve been following this discussion & it reminded me that 3 years ago I wrote a draft of a paper extending the concept of “fragility” to the derivation of diagnostic algorithms based on statistical metric thresholds. I wasn’t convinced of the value (or indeed correct interpretation) of what I’d done, so I’ve not tried to polish or publish it. Nevertheless, the concept of just how susceptible an algorithm is to the vagaries of chance recruitment of a small number of patients is something that is worth thinking about.
I’ve put the draft I wrote as a pre-print on ResearchGate:

That’s horrible!
I guess there is a discussion to be had about whether the survival benefit is worth it (it’s small, drug probably expensive, side effects) but just dismissing this as “no difference” seems to me not to be serving patients well.

Unfortunately, statisticians have allowed and even promoted the use of the p-value threshold as the exclusive determinant of intervention effectiveness. People that rely on these determinations are rightly concerned about results near the threshold, where the determination might have swung in the opposite direction if only one or two study participants had experienced a different outcome. That we now have the “fragility index” is a consequence of the unfortunate dominance of this approach to inference. I agree with @f2harrell that it’s a band-aid.

I wasn’t sure where to put a reference to this interesting paper by Michael Lew on the relationship between (one tailed) P values and likelihood. It discusses some of the complexities on when p values have to be adjusted (ie. when used in the NP decision framework, not in Fisher’s). This seemed to be the best thread.

I learned about it in this thread on Deborah Mayo’s post criticizing Royall’s philosophy of statistical evidence:

1 Like

Anyone who attempts to counter Richard Royall must come well armed …

1 Like

I found Royall’s evidential perspective via likelihood very helpful. Michael Lew does as well.

Mayo’s critique is that error control is missing from the analysis, and we need some design info for that. But it seems that error control is a function of the experimental design (before data point of view), where we fix the effect to a specific value. It does not seem relevant after data have been collected, as long as we assume the experiment was minimally informative (1-\beta > \alpha).

A post data view is interested in how the observed data supports a range of effects, above and below the value used in the design phase. This would seem to be related to your post on Bayesian power.

Perhaps it would be better to look at inference as precise estimation, rather than Mayo’s concept of “severe tests.”

I just thought Lew’s paper was a useful link between what people currently use now, and how they related to a better measure (likelihoods) and hoped to get some expert input on that.

Exactly, and remember that she (and many others) uses the word “error” in a problematic way IMHO. She is speaking of the probability of making an assertion of an effect when the effect is exactly zero. In my view a true error probability is P(effect <= 0 | data) when you act from the data as if effect > 0.


Hi All

As per the advice of Prof Harrell – I will repost this question here:

As a physician writer, I inquire as to the best way to describe a numerically positive but nonsignificant result.

In a column about European guideline writers’ dismissal of the ORBITA trial–a sham-controlled RCT powered to detect a 30-second improvement in exercise time with PCI–I described the results this way.

"The investigators went to great lengths to blind patients and treating clinicians. In the end, PCI improved exercise time, but the difference did not reach statistical significance.

But a commenter opposed this wording:

“Regarding the ORBITA trial you write “In the end, PCI improved exercise time, but the difference did not reach statistical significance.” Properly, if the difference did not reach statistical significance then the null hypothesis was not disproved and it is therefore illegitimate to deduce the positive result and contradictory (and misleading) to write it this way. That’s the point and purpose of the statistical analysis. Wish everybody would respect this niggling but critical distinction.”

I’ve come to learn that writing ORBITA “was negative” is incorrect. I also think writing “there was no difference” is not correct either.

What say the experts?


You would say that the data was not sufficiently conclusive to interpret the direction of the effect (e.g. a mean difference).

The study was “negative” in a way that not (clear) enough data was gathered to be able to interpret as little as just the direction of the effect with sufficient confidence. This is not at all the same as stating that “there is no effect”. In fact, there might be a very relevant effect, but the lack of (sufficiently clear) data simply does not allow a confident conclusion.

It may be helpful to have a closer look at the confidence interval. If it extends into regions that are obviousel (practically) relevant, the data can still be considered compatible with such relevant effects (what implies not more that further studies might be worth to get a better idea). If it does not extend into relevant regions, the data is considered not being compatible with relevant effects, so there issome indication that - whatever sign the effect may have - the effect is unlikely to be relevant.

An entirely different approach is to try an estimation of the effect, that is, to frame a probability statement about the effect, given our understanding of the process and the data gathered in the study. This is called a “Bayesian analysis”, where the model-based likelihood of the data is used to modify a prior probability distribution over the interesting effect. The result is the posterior distribution reflecting your opinion and your uncertainty about the effect (rather than asimple yes/no answer or a decision about the sign of the effect).

1 Like

A previous trial of unblinded PCI improved exercise time 96 seconds. Anti-anginal drugs improve exercise time about 45 seconds. Orbita is powered to detect a conservative increase 30 second increase. The authors write: “We calculated that, from the point of randomisation, a sample size of 100 patients per group had more than 80% power to detect a between-group difference in the increment of exercise duration of 30 seconds, at the 5% significance level, using the two-sample t test of the difference between groups.”

They then found a 16.6 sec difference in favor of PCI. The 95% CI –8·9 to 42·0. The P- 0.20.

So do you still think the data is insufficient to make conclusions? I mean… if we say a trial with this power and these confidence intervals is inconclusive, then, well, I am confused.

1 Like

It’s even more clear after this explanation. The study was planned to accept one of two statistical alternatives: H0: no change, H1: 30 secs increase, with a specified confidence in either acceptance (the sample size was set to achieve these confidences). Now the actual test obtained a test statistic that was obviousely in the acceptance region of H0. Thus, to keep the desired confidences, H0 had to be accepted and the decsion makers should “act as if H0 was true”.


I noticed that my spacekeyis defective. I now added somemissing spaces. Sorry for that.

Further I use a bad nomenclature for the two competeing hypotheses (H0 and H1), what might cause some confusion. I should better have used HA and HB, because the the logic behind a test based on a-priori fixed confidences (i.e. with fixed alpha and beta/power) is to “accept” one of the two alternatives rather than reject H0. Such tests are therefore also called “acceptance tests” or “A/B tests”.

Rejecting H0 is a different logic that applies when significance of the observed data is tested, using this H0 as a reference pointt o be able to judge the significance against. In this logic, H0 is never accepted. If one fails to reject it, the data is inconclusive w.r.t. H0 and one is left with nothing (except the insight that the amount of data was not sufficient, for whatever reasons). If one rejects H0, one can interpret if the effect is “above” or “below” H0 (if H0 is that the effect is zero, it rejecting it means that one dares to inetrpret the sign of the estimated effect). Thus, a significance test (trying to reject H0) warrants really very little insight. It actually warrants only the least possible bit of insight, what by itself may not at all be sufficient to make an informed decision about what to do practically (e.g. to actually favour a treatment). This lack of usefulness was tried to overcome with acceptance tests, by selecting and fixing sensible values for alpha and beta a-priori. This way, data can be used to accept one of the alternatives so that the expected net-loss/win of the decision strategy is minimized/maximized (what is considered a feature of a rational strategy). So if we assume that HA and HB as well as alpha and beta were carefully chosen and sensible, then the decision must be to accept HA, because otherwise the decision strategy would not be rational.


This is an excellent discussion, and it makes me feel even more strongly than before that hypothesis testing has hurt science. Hypothesis testing is useful when judging the existence of some phenomenon, but the vast majority of medical research has an underlying (and often ignored) goal of estimation of the magnitude of effects, whether they be zero or far from zero. Bayesian posterior inference is consistent with this, and also provides simple-to-interpret probabilities that are relevant to the real clinical question: When the prior is smooth, one can compute the probability that a treatment effect is in the right direction, e.g. P(\theta > 0 | \textrm{data}) This does not involve a point null hypothesis. But going back to null hypothesis testing in the frequentist domain, the problem being discuss arose because of (1) dichotomous thinking in the original ORBITA paper and (2) dichotomous thinking in the European guideline thinking. Dichotomies of underlying continuous phenonema (here: magnitude of evidence) always creates anomalies and confusion IMHO.

As a separate issue, the 0.8 power (some would argue that power should never be < 0.9 in the design stage) is relevant pre-study but once the data are in we should ignore it. On a slightly related issue, compatibility intervals do not need to respect the null hypothesis or the power calculation.


FYI- ORBITA is discussed further here:

1 Like

For what it’s worth, @Sander and I have written a pair of papers (preprinted on arXiv) that captures many of the recommendations that we’ve discussed above.

  1. Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise (

  2. To Aid Statistical Inference, Emphasize Unconditional Descriptions of Statistics (

For a basic summary of topics covered:

Paper 1 is a discussion of:

  • P-value issues and their reconciliation with -log transformations, such as S-values (and for what it’s worth, Scientific American recently covered this topic in an article about P-values and statistical significance which involved interviews with Sander, along with several others)

  • Testing several alternative hypotheses of interest rather than just the null (and we also discuss the issue of multiple comparisons, or at least point a more in-depth discussion of it)

  • Graphical functions/tables to present alternative results

Paper 2 is a discussion of

  • Why unconditional interpretations of statistics need to be emphasized (especially in applied fields where the assumptions usually assumed to be true are completely nonsensical, no random mechanisms of any sort)

  • Why terminology change is needed for reform

  • How discussion needs to move on to decisions and their costs (loss functions)

We think many who have been a part of this discussion/who have followed it, will find these resources to be useful


Both papers are very valuable but I especially liked the second one, as I am not sure I’ve found anywhere else such a focused and rigorous discussion of what you very appropriately call “conditional vs unconditional” interpretations of statistics.


Part of the difficulty in appraising/interpreting studies for clinical context is the absence of a pre-defined minimally clinically important difference (MICD).

If any effect size (improvement in exercise time) greater than zero is sufficient to justify PCI, then even if PCI actually does nothing, no study will ever be able to completely “rule out” possible benefit.

At some point, we have to say “improvement less than 45/30/15 seconds” means PCI probably does not provide sufficient clinical benefit.

With Orbita, since the cardiology community never pre-defined the MCID for exercise time in PCI, they can just keep moving the goal post. Prior to Orbita claims suggested 90 second benefits. Orbita confidence intervals (-8.9-42) suggests benefit (if it exists) is not more than 42 seconds. So guidelines/others can change the goal post and say “30 second benefit can’t be excluded” and thus Orbita is not practice changing.

At the end of the day, if folks are asking for a larger RCT without predefining the MCID, then expect the goal posts to continue to change if the results don’t go their way.

As for communicating study results as a journalist, I would always start by asking “what is the MCID”? Is it one value? Do clinicians/patients/etc have different MCID? How do we interpret the confidence interval & results given our clinical expectations? And if the proponents of an intervention can’t decide on an MCID, or keep changing/lowering the MCID, be very suspicious.


Nice thoughts. Bayesian analysis would help, e.g. compute P(effect > MCID) and P(effect > MCID/2).

1 Like

EDIT: I’ll edit away the parts that are overlapping with what has previously been discussed here later.

First some background for the creation of this topic:

After the advent of new guidelines on how to ditch p-values and terms such as “significant”, “non-significant” and “trends for this and that” have created some challenges. This got me thinking because we have a bunch of data from a pilot study that we want to publish. As is common in pilot studies, estimates are unreliable and confidence intervals are usually quite broad (which is what we expect).

Anyway, we do have some findings that would qualify as “trends” in ancient times. I.e. the CI barely crosses the null.

When we are writing up the results section, and reporting what we consider quite important (yet preliminary) results, I am struggling a bit with the wording. I could make this easy and write “a trend for an effect was observed for x on y (point estimate, CI)”, but I want to do this the new way. As I see it I have two alternatives:

  1. Write that there was a positive effect of X on Y, Give the point estimate and the CI (which crosses the null) and write nothing more. Leave it to the reader to interpret.

  2. Give the point estimate and the CI, and describe in some way that the result is inconclusive because the CI crosses the null. However, I find that writing this is somewhat clunky:

The estimated mean changes in plasma cardiometabolic variables indicated improvements in concentrations of Y with an increase of 6.7 umol/L (95% CI: -3.3 – 16.6) compared to controls. The 95% CIs showed that the data were compatible with both reductions and increases in the marker.

This is kind of a hassle when we have many inconclusive results. We could be explicit about this in the methods section but I am not sure editors/reviewers that are not statistically oriented will accept this, nor that they will accept what I write in 1.

So my question to you is, do you have any good examples on reporting results according to the new guidelines from ASA, where terms like significance and trends have been abandoned? I know they were just recently put out, but assembling some of them here would be useful, I think.

PS: This being a pilot study is beside the point, this could just as well be a problem for large-scale studies.

I don’t have an example publication, but a possible solution:
Why don’t you just state that “p < 0.1 is considered significant”. Then all the results with p close to 0.05 will be considered significant and you can interpret the signs of the respective estimates. You can even add a further centence explaining this (not so usualy) choice, like “We decided to use a more liberal significance level than the conventional 0.05 level because the study is a pilot study where effects may be interpreted with lower confidence standards.” (or something similar).

I very strongly recommend not to talk about “trends” for estimates that did not reach the desired statistical significance. This is what the whole game is about: benchmark your data against a Null, and if this is “significant”, you may interpret the sign of (Estimate - Null); otherwise you won’t. If the Null is just zero, it means you may then interpret the sign of the estimate. This is just the trend. The significance test does not give you anything more than that! If the test fails to reach your desired level of significance, you deem your data as too inconclusive to even discuss the sign (not to mention any more specific effect size). If you say that a test is not significant, you say that your data is insufficient to make any (sufficiently confident) statement about the sign (=trend). If you then make statements about a trend, you wipe away all your efforts to give your conclusions some kind of confidence.

Instead, be more specific regarding your required confidence. If this does not need to be so high, select a more liberal level of significance.

Note: results are not inherently conclusive or inconclusive. Conclusions are made by you (not by the data). We like to draw conclusions with high confidence. We choose(!) a level of significance for tests to conclude trends(!) with a minimum desired confidence.

Thank you for your feedback Jochen. I 100% agree that we should avoid words like trends etc at all costs.

Regarding setting the p at 0.1; isn’t this just moving the goalposts? Although I can see how this is more favorable (to us) I can see the editor and referees objecting, and it seems awfully convenient to set it at 0.1 after we have done the analyses. The point, at least in my view, is to make it clear to the editor and reader that we consider our effects meaningful for later studies even if p > 0.05 and the CI crosses the null. We have to be able to make this point regardless of the alpha, which admittedly is harder than I thought and why I was looking for examples of good practice.

For now, at this very preliminary stage, we have added these sentences to the methods section after consulting amongst ourselves and with a statistician:

Because this is a pilot study, emphasis should be placed on confidence intervals which
generally indicate a wide range of possible intervention effects. We have reported p-values,
but stress that due to the small sample size they should be interpreted with care and in combination with the confidence intervals. Several of the findings are considered meaningful for the design of a larger study even if p > 0.05.