Language for communicating frequentist results about treatment effects

timdisher · April 11, 2019, 5:00pm

Except for misguided variable selection

albertoca · April 18, 2019, 5:10pm

Very nice and useful !
Thanks

R_cubed · April 23, 2019, 2:12am

This proposal to transform p values into logs is nearly identical to Fisher’s procedure to combine significance tests in a meta-analysis.

(The Wikipedia link uses natural logs, but Hedges and Olkin have it expressed in logs without necessarily clarifying if it is base 2 or not).

The team of Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte published a book in 2008 on Meta-analysis that takes the approach of defining “evidence” as the inverse normal transform of the p value \Phi^{-1}(1-p). Their techniques look very similar to a weighted Stouffer combined test which is an average of Z scores:

T_{1..k} = \frac{\Sigma{\surd{n_{1}}(S_1)}...{\surd{n_{k}}(S_k)}}{\Sigma{\surd{n_{1}...\surd{n_{k}}}}} where S is a variance stabilizing transform of the statistic used to calculate the p value.

Instead of using the statistic to “test” hypotheses, they point out that this transform can be used to quantify “evidence” of the alternative (without specifying any particular effect). In this framework, evidence is always a random quantity, with a standard error of \pm{1}.

‘Weak’ evidence for (unspecified) alternative is T = \pm 1.64 , that corresponds to a p = 0.05,
‘Moderate’ evidence is T= \pm 3.3, that corresponds to a p = 0.0004
‘Strong’ evidence is T= \pm 5, corresponding to a p value somewhere near 2.86 * 10^{-7}

In favor of this transform, I notice that some common p value errors can be avoided:

absence of evidence is evidence of absence.
can demonstrate Simpson’s paradox in relation to p values (ie. a number of ‘not significant’ results can in aggregate provide evidence for an effect, and vice versa).
It can be useful in meta-analysis.

While it doesn’t solve all of the problems with P values, it seems to alleviate some of the most troubling ones.

My confusion stems from the caution expressed in Hedges and Olkin Statistical Methods for Meta-Analysis, where they state:

Blockquote
…an investigator may incorrectly conclude that because H_{0} is rejected, the treatment effects are greater than zero … Alternatively, an investigator might incorrectly conclude that the treatment showed consistent effect across studies…

It would seem that the authors above are making interpretations that are warned against by a widely cited text. Or are they both correct, in their own way?

Google preview of book:

RonanConroy · April 24, 2019, 11:09am

SHORT COMMUNICATION

Free Access

Is the weighted z ‐test the best method for combining probabilities from independent tests?

Z. CHEN

Theory has proven that no method is uniformly most powerful (Birnbaum, 1954). However, it is possible that under a certain situation, one particular method may outperform others. For example, through simulation Whitlock (Whitlock, 2005) has shown that when all the studies have the same effect sizes, the weighted z ‐test outperforms both regular z ‐test and Fisher method. In this paper, we use simulation to show that under the same situation, the generalized Fisher method is more powerful than the weighted z ‐test.

Long story short : jury is still out, and statisticians are still rising to the challenge of rainy afternoons by running simulation studies.

R_cubed · April 24, 2019, 11:53am

Thanks. I do know that each of these p value combination methods has a scenario where it works best.

I was looking for a transform that could be easily taught that prevents the lengthy list of cognitive errors statisticians have been complaining about for decades.

The Kulinskaya, Morgenthaler, and Staudte book persuaded me that the modified z score test, with weights based on \surd{n} from each study, is a very reasonable alternative scale that will prevent the “absence of evidence” fallacy, and can also show how a number of “insignificant” studies can in aggregate, demonstrate an effect.

I went back and used the original Stouffer procedure on the P values in the Greenland, Amrheim, and McShane article in Nature. Using the evidential interpretation, there was moderate evidence of a positive association even without the sample sizes.

f2harrell · April 24, 2019, 12:18pm

Now I’d like to get back to single study interpretations, the original point of this discussion topic. Thanks.

albertoca · April 24, 2019, 5:36pm

A variant may be to conclude that the study did not achieve its primary aim, so “no effect”. For example, the GOLD study. What do you think about the interpretation of the result in the abstract? https://www.sciencedirect.com/science/article/pii/S1470204517306824
Is it possible that in these large trials the language is a bad adaptation of the jargon of the regulatory agencies?

f2harrell · April 24, 2019, 5:48pm

“The GOLD study did not meet its primary objective of showing a significant improvement in overall survival with olaparib in the overall or ATM-negative population of Asian patients with advanced gastric cancer.” is not bad but “significant” doesn’t help.

albertoca · April 24, 2019, 5:54pm

I don’t know if with 2 more patients the conclusion would have been the opposite, look above:
Overall survival did not differ between treatment groups in the overall patient population (median overall survival 8·8 months [95% CI 7·4–9·6] in the olaparib group vs 6·9 months [6·3–7·9] in the placebo group; HR 0·79 [97·5% CI 0·63–1·00]; p=0·026) or in the ATM-negative population (12·0 months [7·8–18·1] vs 10·0 months [6·4–13·3]; 0·73 [0·40–1·34]; p=0·25).

kiwiskiNZ · April 30, 2019, 9:16pm

I’ve been following this discussion & it reminded me that 3 years ago I wrote a draft of a paper extending the concept of “fragility” to the derivation of diagnostic algorithms based on statistical metric thresholds. I wasn’t convinced of the value (or indeed correct interpretation) of what I’d done, so I’ve not tried to polish or publish it. Nevertheless, the concept of just how susceptible an algorithm is to the vagaries of chance recruitment of a small number of patients is something that is worth thinking about.
I’ve put the draft I wrote as a pre-print on ResearchGate: https://www.researchgate.net/publication/332767057_The_application_of_fragility_indices_to_diagnostic_studies

simongates · May 1, 2019, 9:29am

That’s horrible!
I guess there is a discussion to be had about whether the survival benefit is worth it (it’s small, drug probably expensive, side effects) but just dismissing this as “no difference” seems to me not to be serving patients well.

biostatmatt · June 7, 2019, 5:32pm

Unfortunately, statisticians have allowed and even promoted the use of the p-value threshold as the exclusive determinant of intervention effectiveness. People that rely on these determinations are rightly concerned about results near the threshold, where the determination might have swung in the opposite direction if only one or two study participants had experienced a different outcome. That we now have the “fragility index” is a consequence of the unfortunate dominance of this approach to inference. I agree with @f2harrell that it’s a band-aid.

R_cubed · August 12, 2019, 2:57am

I wasn’t sure where to put a reference to this interesting paper by Michael Lew on the relationship between (one tailed) P values and likelihood. It discusses some of the complexities on when p values have to be adjusted (ie. when used in the NP decision framework, not in Fisher’s). This seemed to be the best thread.

I learned about it in this thread on Deborah Mayo’s post criticizing Royall’s philosophy of statistical evidence:

f2harrell · August 12, 2019, 10:59am

Anyone who attempts to counter Richard Royall must come well armed …

R_cubed · August 12, 2019, 1:45pm

I found Royall’s evidential perspective via likelihood very helpful. Michael Lew does as well.

Mayo’s critique is that error control is missing from the analysis, and we need some design info for that. But it seems that error control is a function of the experimental design (before data point of view), where we fix the effect to a specific value. It does not seem relevant after data have been collected, as long as we assume the experiment was minimally informative (1-\beta > \alpha).

A post data view is interested in how the observed data supports a range of effects, above and below the value used in the design phase. This would seem to be related to your post on Bayesian power.

Perhaps it would be better to look at inference as precise estimation, rather than Mayo’s concept of “severe tests.”

I just thought Lew’s paper was a useful link between what people currently use now, and how they related to a better measure (likelihoods) and hoped to get some expert input on that.

f2harrell · August 12, 2019, 3:40pm

Exactly, and remember that she (and many others) uses the word “error” in a problematic way IMHO. She is speaking of the probability of making an assertion of an effect when the effect is exactly zero. In my view a true error probability is P(effect <= 0 | data) when you act from the data as if effect > 0.

drjohnm · September 7, 2019, 7:43pm

Hi All

As per the advice of Prof Harrell – I will repost this question here:

As a physician writer, I inquire as to the best way to describe a numerically positive but nonsignificant result.

In a column about European guideline writers’ dismissal of the ORBITA trial–a sham-controlled RCT powered to detect a 30-second improvement in exercise time with PCI–I described the results this way.

"The investigators went to great lengths to blind patients and treating clinicians. In the end, PCI improved exercise time, but the difference did not reach statistical significance.

But a commenter opposed this wording:

“Regarding the ORBITA trial you write “In the end, PCI improved exercise time, but the difference did not reach statistical significance.” Properly, if the difference did not reach statistical significance then the null hypothesis was not disproved and it is therefore illegitimate to deduce the positive result and contradictory (and misleading) to write it this way. That’s the point and purpose of the statistical analysis. Wish everybody would respect this niggling but critical distinction.”

I’ve come to learn that writing ORBITA “was negative” is incorrect. I also think writing “there was no difference” is not correct either.

What say the experts?

Jochen · September 7, 2019, 7:51pm

You would say that the data was not sufficiently conclusive to interpret the direction of the effect (e.g. a mean difference).

The study was “negative” in a way that not (clear) enough data was gathered to be able to interpret as little as just the direction of the effect with sufficient confidence. This is not at all the same as stating that “there is no effect”. In fact, there might be a very relevant effect, but the lack of (sufficiently clear) data simply does not allow a confident conclusion.

It may be helpful to have a closer look at the confidence interval. If it extends into regions that are obviousel (practically) relevant, the data can still be considered compatible with such relevant effects (what implies not more that further studies might be worth to get a better idea). If it does not extend into relevant regions, the data is considered not being compatible with relevant effects, so there issome indication that - whatever sign the effect may have - the effect is unlikely to be relevant.

An entirely different approach is to try an estimation of the effect, that is, to frame a probability statement about the effect, given our understanding of the process and the data gathered in the study. This is called a “Bayesian analysis”, where the model-based likelihood of the data is used to modify a prior probability distribution over the interesting effect. The result is the posterior distribution reflecting your opinion and your uncertainty about the effect (rather than asimple yes/no answer or a decision about the sign of the effect).

drjohnm · September 7, 2019, 8:03pm

A previous trial of unblinded PCI improved exercise time 96 seconds. Anti-anginal drugs improve exercise time about 45 seconds. Orbita is powered to detect a conservative increase 30 second increase. The authors write: “We calculated that, from the point of randomisation, a sample size of 100 patients per group had more than 80% power to detect a between-group difference in the increment of exercise duration of 30 seconds, at the 5% significance level, using the two-sample t test of the difference between groups.”

They then found a 16.6 sec difference in favor of PCI. The 95% CI –8·9 to 42·0. The P- 0.20.

So do you still think the data is insufficient to make conclusions? I mean… if we say a trial with this power and these confidence intervals is inconclusive, then, well, I am confused.

Jochen · September 8, 2019, 7:50am

It’s even more clear after this explanation. The study was planned to accept one of two statistical alternatives: H0: no change, H1: 30 secs increase, with a specified confidence in either acceptance (the sample size was set to achieve these confidences). Now the actual test obtained a test statistic that was obviousely in the acceptance region of H0. Thus, to keep the desired confidences, H0 had to be accepted and the decsion makers should “act as if H0 was true”.

EDIT:

I noticed that my spacekeyis defective. I now added somemissing spaces. Sorry for that.

Further I use a bad nomenclature for the two competeing hypotheses (H0 and H1), what might cause some confusion. I should better have used HA and HB, because the the logic behind a test based on a-priori fixed confidences (i.e. with fixed alpha and beta/power) is to “accept” one of the two alternatives rather than reject H0. Such tests are therefore also called “acceptance tests” or “A/B tests”.

Rejecting H0 is a different logic that applies when significance of the observed data is tested, using this H0 as a reference pointt o be able to judge the significance against. In this logic, H0 is never accepted. If one fails to reject it, the data is inconclusive w.r.t. H0 and one is left with nothing (except the insight that the amount of data was not sufficient, for whatever reasons). If one rejects H0, one can interpret if the effect is “above” or “below” H0 (if H0 is that the effect is zero, it rejecting it means that one dares to inetrpret the sign of the estimated effect). Thus, a significance test (trying to reject H0) warrants really very little insight. It actually warrants only the least possible bit of insight, what by itself may not at all be sufficient to make an informed decision about what to do practically (e.g. to actually favour a treatment). This lack of usefulness was tried to overcome with acceptance tests, by selecting and fixing sensible values for alpha and beta a-priori. This way, data can be used to accept one of the alternatives so that the expected net-loss/win of the decision strategy is minimized/maximized (what is considered a feature of a rational strategy). So if we assume that HA and HB as well as alpha and beta were carefully chosen and sensible, then the decision must be to accept HA, because otherwise the decision strategy would not be rational.