Language for communicating frequentist results about treatment effects

It’s even more clear after this explanation. The study was planned to accept one of two statistical alternatives: H0: no change, H1: 30 secs increase, with a specified confidence in either acceptance (the sample size was set to achieve these confidences). Now the actual test obtained a test statistic that was obviousely in the acceptance region of H0. Thus, to keep the desired confidences, H0 had to be accepted and the decsion makers should “act as if H0 was true”.

EDIT:

I noticed that my spacekeyis defective. I now added somemissing spaces. Sorry for that.

Further I use a bad nomenclature for the two competeing hypotheses (H0 and H1), what might cause some confusion. I should better have used HA and HB, because the the logic behind a test based on a-priori fixed confidences (i.e. with fixed alpha and beta/power) is to “accept” one of the two alternatives rather than reject H0. Such tests are therefore also called “acceptance tests” or “A/B tests”.

Rejecting H0 is a different logic that applies when significance of the observed data is tested, using this H0 as a reference pointt o be able to judge the significance against. In this logic, H0 is never accepted. If one fails to reject it, the data is inconclusive w.r.t. H0 and one is left with nothing (except the insight that the amount of data was not sufficient, for whatever reasons). If one rejects H0, one can interpret if the effect is “above” or “below” H0 (if H0 is that the effect is zero, it rejecting it means that one dares to inetrpret the sign of the estimated effect). Thus, a significance test (trying to reject H0) warrants really very little insight. It actually warrants only the least possible bit of insight, what by itself may not at all be sufficient to make an informed decision about what to do practically (e.g. to actually favour a treatment). This lack of usefulness was tried to overcome with acceptance tests, by selecting and fixing sensible values for alpha and beta a-priori. This way, data can be used to accept one of the alternatives so that the expected net-loss/win of the decision strategy is minimized/maximized (what is considered a feature of a rational strategy). So if we assume that HA and HB as well as alpha and beta were carefully chosen and sensible, then the decision must be to accept HA, because otherwise the decision strategy would not be rational.

2 Likes

This is an excellent discussion, and it makes me feel even more strongly than before that hypothesis testing has hurt science. Hypothesis testing is useful when judging the existence of some phenomenon, but the vast majority of medical research has an underlying (and often ignored) goal of estimation of the magnitude of effects, whether they be zero or far from zero. Bayesian posterior inference is consistent with this, and also provides simple-to-interpret probabilities that are relevant to the real clinical question: When the prior is smooth, one can compute the probability that a treatment effect is in the right direction, e.g. P(\theta > 0 | \textrm{data}) This does not involve a point null hypothesis. But going back to null hypothesis testing in the frequentist domain, the problem being discuss arose because of (1) dichotomous thinking in the original ORBITA paper and (2) dichotomous thinking in the European guideline thinking. Dichotomies of underlying continuous phenonema (here: magnitude of evidence) always creates anomalies and confusion IMHO.

As a separate issue, the 0.8 power (some would argue that power should never be < 0.9 in the design stage) is relevant pre-study but once the data are in we should ignore it. On a slightly related issue, compatibility intervals do not need to respect the null hypothesis or the power calculation.

8 Likes

FYI- ORBITA is discussed further here:

1 Like

For what it’s worth, @Sander and I have written a pair of papers (preprinted on arXiv) that captures many of the recommendations that we’ve discussed above.

  1. Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise (https://arxiv.org/abs/1909.08579)

  2. To Aid Statistical Inference, Emphasize Unconditional Descriptions of Statistics (https://arxiv.org/abs/1909.08583)

For a basic summary of topics covered:

Paper 1 is a discussion of:

  • P-value issues and their reconciliation with -log transformations, such as S-values (and for what it’s worth, Scientific American recently covered this topic in an article about P-values and statistical significance which involved interviews with Sander, along with several others)

  • Testing several alternative hypotheses of interest rather than just the null (and we also discuss the issue of multiple comparisons, or at least point a more in-depth discussion of it)

  • Graphical functions/tables to present alternative results

Paper 2 is a discussion of

  • Why unconditional interpretations of statistics need to be emphasized (especially in applied fields where the assumptions usually assumed to be true are completely nonsensical, no random mechanisms of any sort)

  • Why terminology change is needed for reform

  • How discussion needs to move on to decisions and their costs (loss functions)

We think many who have been a part of this discussion/who have followed it, will find these resources to be useful

6 Likes

Both papers are very valuable but I especially liked the second one, as I am not sure I’ve found anywhere else such a focused and rigorous discussion of what you very appropriately call “conditional vs unconditional” interpretations of statistics.

3 Likes

Part of the difficulty in appraising/interpreting studies for clinical context is the absence of a pre-defined minimally clinically important difference (MICD).

If any effect size (improvement in exercise time) greater than zero is sufficient to justify PCI, then even if PCI actually does nothing, no study will ever be able to completely “rule out” possible benefit.

At some point, we have to say “improvement less than 45/30/15 seconds” means PCI probably does not provide sufficient clinical benefit.

With Orbita, since the cardiology community never pre-defined the MCID for exercise time in PCI, they can just keep moving the goal post. Prior to Orbita claims suggested 90 second benefits. Orbita confidence intervals (-8.9-42) suggests benefit (if it exists) is not more than 42 seconds. So guidelines/others can change the goal post and say “30 second benefit can’t be excluded” and thus Orbita is not practice changing.

At the end of the day, if folks are asking for a larger RCT without predefining the MCID, then expect the goal posts to continue to change if the results don’t go their way.

As for communicating study results as a journalist, I would always start by asking “what is the MCID”? Is it one value? Do clinicians/patients/etc have different MCID? How do we interpret the confidence interval & results given our clinical expectations? And if the proponents of an intervention can’t decide on an MCID, or keep changing/lowering the MCID, be very suspicious.

6 Likes

Nice thoughts. Bayesian analysis would help, e.g. compute P(effect > MCID) and P(effect > MCID/2).

1 Like

EDIT: I’ll edit away the parts that are overlapping with what has previously been discussed here later.

First some background for the creation of this topic:

After the advent of new guidelines on how to ditch p-values and terms such as “significant”, “non-significant” and “trends for this and that” have created some challenges. This got me thinking because we have a bunch of data from a pilot study that we want to publish. As is common in pilot studies, estimates are unreliable and confidence intervals are usually quite broad (which is what we expect).

Anyway, we do have some findings that would qualify as “trends” in ancient times. I.e. the CI barely crosses the null.

When we are writing up the results section, and reporting what we consider quite important (yet preliminary) results, I am struggling a bit with the wording. I could make this easy and write “a trend for an effect was observed for x on y (point estimate, CI)”, but I want to do this the new way. As I see it I have two alternatives:

  1. Write that there was a positive effect of X on Y, Give the point estimate and the CI (which crosses the null) and write nothing more. Leave it to the reader to interpret.

  2. Give the point estimate and the CI, and describe in some way that the result is inconclusive because the CI crosses the null. However, I find that writing this is somewhat clunky:

The estimated mean changes in plasma cardiometabolic variables indicated improvements in concentrations of Y with an increase of 6.7 umol/L (95% CI: -3.3 – 16.6) compared to controls. The 95% CIs showed that the data were compatible with both reductions and increases in the marker.

This is kind of a hassle when we have many inconclusive results. We could be explicit about this in the methods section but I am not sure editors/reviewers that are not statistically oriented will accept this, nor that they will accept what I write in 1.

So my question to you is, do you have any good examples on reporting results according to the new guidelines from ASA, where terms like significance and trends have been abandoned? I know they were just recently put out, but assembling some of them here would be useful, I think.

PS: This being a pilot study is beside the point, this could just as well be a problem for large-scale studies.

I don’t have an example publication, but a possible solution:
Why don’t you just state that “p < 0.1 is considered significant”. Then all the results with p close to 0.05 will be considered significant and you can interpret the signs of the respective estimates. You can even add a further centence explaining this (not so usualy) choice, like “We decided to use a more liberal significance level than the conventional 0.05 level because the study is a pilot study where effects may be interpreted with lower confidence standards.” (or something similar).

I very strongly recommend not to talk about “trends” for estimates that did not reach the desired statistical significance. This is what the whole game is about: benchmark your data against a Null, and if this is “significant”, you may interpret the sign of (Estimate - Null); otherwise you won’t. If the Null is just zero, it means you may then interpret the sign of the estimate. This is just the trend. The significance test does not give you anything more than that! If the test fails to reach your desired level of significance, you deem your data as too inconclusive to even discuss the sign (not to mention any more specific effect size). If you say that a test is not significant, you say that your data is insufficient to make any (sufficiently confident) statement about the sign (=trend). If you then make statements about a trend, you wipe away all your efforts to give your conclusions some kind of confidence.

Instead, be more specific regarding your required confidence. If this does not need to be so high, select a more liberal level of significance.

Note: results are not inherently conclusive or inconclusive. Conclusions are made by you (not by the data). We like to draw conclusions with high confidence. We choose(!) a level of significance for tests to conclude trends(!) with a minimum desired confidence.

1 Like

Thank you for your feedback Jochen. I 100% agree that we should avoid words like trends etc at all costs.

Regarding setting the p at 0.1; isn’t this just moving the goalposts? Although I can see how this is more favorable (to us) I can see the editor and referees objecting, and it seems awfully convenient to set it at 0.1 after we have done the analyses. The point, at least in my view, is to make it clear to the editor and reader that we consider our effects meaningful for later studies even if p > 0.05 and the CI crosses the null. We have to be able to make this point regardless of the alpha, which admittedly is harder than I thought and why I was looking for examples of good practice.

For now, at this very preliminary stage, we have added these sentences to the methods section after consulting amongst ourselves and with a statistician:

Because this is a pilot study, emphasis should be placed on confidence intervals which
generally indicate a wide range of possible intervention effects. We have reported p-values,
but stress that due to the small sample size they should be interpreted with care and in combination with the confidence intervals. Several of the findings are considered meaningful for the design of a larger study even if p > 0.05.

1 Like

I like everything you said except for using the word significant. I think it’s become a meaningless word, and it will be an arbitrary designation because the cutoff, whatever is used, is arbitrary. Why not use 0.071828?

I used this word because it is used just like that in almost all publications I know. A better term might be the more specific “statistically significant”. I don’t thinkt that this is a meaningless word. It means that interpretations of “trends” are done keeping a specified level of confidence.

The actual level is arbirtary - it’s a decision made by the researcher. We know that 0.05 is a mere convention without any reasonable justification, and that using the same level for all interpretations surely is inadequate (if not harmful) - a fact that was already stressed by Fisher. Following Fisher it does well make sense to make not all interpretation at the same level, so for one experiment p = 0.024 might not be considered significant (means: one does not like to interpret the “trend”), in another experiment, p = 0.12 migh already be sufficient to interpret the “trend”. Thhus, different p-values may be considered (statsistially) significant, depending on the context. Although I think that this is perfectly ok, I think that not many co-authors, reviewers and editors would agree.

Yes, you lower the level of confidence used for interpretations of your data. This is what you want. By interpreting “trends” where p > 0.05, you just do that, except that you don’t control the level of confidence anymore.

I like the sentence how you wrote it. Sounds very good. Just take care that you don’t interpret “trends” in the manuscript, where the CI is predominantly at one side of the Null. Better stress that one of the limits extends to (biologically, practically) relevant figures, so your data is compatible with relevant (positive or negative) effects. This does not mean that you interpret that there is such a trend (a positive or negative effect) - it just means that the data is worth further investigation. I think this is pretty exactly the point you want to make?

Ah, now I see what you mean, and this makes sense.

However, I tend (doh) to agree with @f2harrell regarding arbitrariness and avoid the sticking to a cut-off. Glad you liked the formulation, we will probably try to go for something like that or similar, and I agree with the rest of your points. Cheers!

EDIT: I’ll add to this the following quote from the CONSORT extension to Pilot and Feasibility trials:

Typically, any estimates of effect using participant outcomes as they are likely to be measured in the future definitive RCT would be reported as estimates with 95% confidence intervals without P values—because pilot trials are not powered for testing hypotheses about effectivenes.

A friendly disagreement. I believe that both significant and statistically significant are meaningless and harmful to science.

1 Like

I really struggle with MCIDs because it just feels like another way to dichotomize results and nothing I have seen does a good job of incorporating uncertainty in the MCID itself. I suppose you do this in a Bayesian way by placing a distribution on the MCID value as well, calculating the prob > MCID at each iteration using a different value of the MCID?

I share much of your concern. The fundamental problem with MCID is treating it as a single number when in fact different patients and different physicians will have different MCIDs. That’s why presenting a Bayesian analysis with all possible MCID’s (smooth plot of posterior probability of being beyond each MCID in terms of efficacy) is preferred.

1 Like

I think we agree in practice but not in theory. Without any doubt, the term “(statistically) significant” is misunderstood (and as such meaningless at best and harmful at worst) in practice. People (wrongly) think that significance is a proof of the existence of an effect, if not a relevant effect. That led to many bad conclusions and decisions, even to absurd procedures (like formal tests to justify assumptions, or step-wise selection procedures). Seeing what is done, practically, lets me absolutely agree with your point-of-view, that “significance” is meaningless and harmful.

However :slight_smile:, the theoretical concept still sounds sensible to me. The p-value measures the “unexpectedness of more extreme (sufficient) test statistic under H0 (and the statistical model) than the one calculated from the observed data”. This lengthy description is condensed in the word “significance”, and in the common way of using a language it is ok and understandable, I think, to say that the data “is significant” if its significance is so high that it seems obvious that data + model + H0 don’t go together. Iff (i) the data were collected with greatest care and (ii) the model is described correctly, then and only then a high significance indicates that the information in the data considerably discredits H0 (and the information is sufficient to interpret “at which side of H0” the data is). The practical point is, for sure, that it’s impossible to have a “correct model” (it’s a model!). So, strictly following the first part of George Box’s citation about models (“All models are wrong”), the whole concept is wrong, and in this line one may conclude that it’s meaningless and even harmful. But the second part (“but some are useful”) supports my view: if the model is not too bad, and the data is not too much, the significance (p-value) is a useful measure to judge the information of the data regarding H0.

I am curious how you think about this (in other words: I am happy to learn where I am wrong).

1 Like

I like this idea. It’s very similar to cost-effectiveness acceptability curves (CEACs) used in health ec, where probability of being cost-effective is plotted as a smooth function of willingness to pay per unit outcome (typicaly quality adjusted life-years). I have a project coming up in the next 6 months or so that this will be useful for, I’ll let you know how it goes!

1 Like

Thanks for the follow-up. I just think that we can cut through a lot of mess by not trying to do inference and instead making statements such as “Drug B probably (0.96) has a better blood pressure response than drug A.”

1 Like