While reviewing the specific literature in my office, I came across the following article: https://www.ncbi.nlm.nih.gov/pubmed/28836150

This is a meta-analysis around the risk of intracranial bleeding in patients with brain tumors receiving anticoagulant therapy, something that might be devastating for patients. The authors conclude that there is no increased risk of bleeding with odds ratio 1.37 (CI 0.86-2.17, p = 0.18). So this is a serious safety issue, as these bleeds can be fatal, and it is a clinically frequent situation in the clinic. This type of fallacy has already been discussed here several times, but in this case, what strikes me is that it is a matter of patient safety, not efficacy, as I have seen on other occasions. Any comments of insights here on this recurrent topic?

This is a good example to bring up, and you might merge this topic with the one on interpretation of frequentist results. For safety there is an additional issue: the choice of primary metric, between a relative effect (as you quoted) or an absolute effect such as absolute risk increase. I lean towards the latter for safety reporting. It is more what matters to patients.

That is a great example of the absence of evidence fallacy!

My question to the scholars here (if this is better in a new thread, I will re-post):

Can we prevent this error by teaching p value transforms? I do not have the exact citations in front of me, but I recall at least 2 papers by Sanders Greenland advocating for the use of p value transformations in terms of thinking about the evidence from a frequentist experiment. I thought it was a very good idea, but did not pursue it at the time.

Citation:

Sander Greenland, Invited Commentary: The Need for Cognitive Science in Methodology, *American Journal of Epidemiology*, Volume 186, Issue 6, 15 September 2017, Pages 639â€“645, https://doi.org/10.1093/aje/kwx259

My preference would be to teach the inverse normal transform \Phi^{-1}(1-p) as a frequentist evidence statistic. The mathematical maturity needed is minimal, and it can be done in a spreadsheet.

I can see it as a way to embrace, extend, and finally extinguish the role that the NHST has on the cognition of researchers.

In terms of the p values in the abstract, we would have an evidence metric of 1.34 \pm 1, giving what I would call â€śmarginalâ€ť to â€śweakâ€ť evidence of a positive effect.

While not as elegant as likelihood and Bayesian evidence measures, they do have a lot of practical merit in terms of preventing the cognitive errors so frequently discussed. They also make a good compare/contrast to the benefits of specifying 1 model (as in Fisherâ€™s method) vs. 2 models (as in likelihood/Bayesian frameworks).

Much of the theory has been worked out by Kulinskaya, Morgenthaler, and Stadute in their text

Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence

This metric also plays an important role in adaptive trial designs as described Mark Chang in

Introductory Adaptive Trial Designs: A Practical Guide with R

About 7 months ago, I was the clinical instructor of someone preparing to enter the field of physical rehabilitation. I was looking at her test prep materials and came across a question on p values that I knew was wrong. It was something along the lines of:

Blockquote

You have read 3 studies on an intervention. The p values are 0.6, 0.06, and 0.01 in favor of treatment. [I assume they are 2 tailed]. In terms of evidence based practice, what should you do?"

The â€śrightâ€ť answer was something like â€śDo not use the intervention until there is better evidence.â€ť

Suffice it to say, I showed her why it was wrong (including reading the Greenland, Senn, et. al. piece on p value misconceptions, and the ASA p value statement). At the time, I advised her to look at an aggregate effect sizes and condition on the actual data, and focus less on the hypothesis test formulation.

I also told her that this naive viewpoint was simplistic in terms of not accounting for publication bias, but in terms of reasoning about evidence, it was closer to the correct approach than the bookâ€™s guidance.

I didnâ€™t want to get into the lack of power of â€śvote countâ€ť meta-analysis techniques, but maybe I should have.

I wasnâ€™t all that satisfied with my answer; I didnâ€™t think it would be persuasive enough to convince any test prep authors to change the question. But the P value transform technique would.

In the example I posed, if we were to use the p value combination technique in a meta-analysis, we would perform \frac{\Sigma\Phi^{-1}(1 - p_n/2)}{\sqrt{n}} (assuming the reported p value is 2 tailed) as in the Stouffer test without sample sizes. Lipakâ€™s weighted version would be preferable if we had them available. The evidence is a random quantity (just like p) with a standard error of \pm 1. It provides evidence of direction if we do not have the actual effect size.

In our naive â€śhypothesis testingâ€ť version, our meta-analysis would â€śrejectâ€ť the null of no effect in any study with p = 0.002, when I do this in a spreadsheet. But we can keep the metric as a continuous quantity. Kulinskaya et. al. would say there is â€śweakâ€ť to â€śmoderateâ€ť evidence of a positive effect. â€śWeakâ€ť evidence is between 1.64 and 3.3. Moderate is between 3.33 and 5. Strong is > 5.