Language for communicating frequentist results about treatment effects

Sander · November 15, 2018, 5:35am

For the record, I disagree with a lot of what is given here.
To start, I don’t think this statement is perfectly accurate:
“We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size.”
The statement claims zero evidence against the tested hypothesis, whereas I would say instead the p=0.4 constitutes almost no evidence against that hypothesis (by the Shannon surprisal measure I use, p=0.4 supplies s = log2(1/0.4) = 1.3 bits of information against the hypothesis - barely more information than in one coin toss). The sample size addition is unnecessary if harmless, as p combines (some would say confounds) sample size and estimate size and a larger estimate would have likely produced a smaller p at the same sample size (only “likely” because exceptions can occur when the standard error inflates faster than the estimate at a given sample size).

Conversely, I think it is misleading to say “The study found evidence against the hypothesis that A=B (p=0.02)” because there is no magic point at which P-values start to “find evidence” against hypotheses. Instead one could say “The study found log2(1/.02) = 5.6 bits of information against the hypothesis that A=B.” The ultimate goal is to get away from dichotomies like “negative” and “positive” trials - those are all experimental tests of the hypothesis, and their P-values measure only the amount of information each supply against that hypothesis.

I might agree with the sentiment behind “As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B”, but it also seems wrong as stated because

it seems to confuse the pre-specification problem with the fact that P-values do not (in isolation, at least) measure support, period, only degree of conflict (incompatibility) between models and data, and
a frequentist can assess similarity indirectly by specifying a similarity (equivalence) interval and seeing whether there is more than a given amount of evidence against the true difference being outside that interval. Thus I would restate it as something like “As the statistical analysis plan did not specify an interval of equivalence, we did not assess similarity of A and B.”

I also think this statement commits a potentially important if standard mistake of omission:
“Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A.” It needs amendment to something like
“Assuming the study’s sampling scheme, experimental design, and analysis protocol, the probability is 0.4 that another study would yield a test statistic for comparing two means that is as or more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A and all statistical modeling assumptions used to get p are correct or harmless.” That’s because almost all statistics I see in my areas are derived from regression models (whether for the outcome, treatment, or both). Sorry if the added conditions seem nit-picky, but it is not hard to find examples where their failure has a non-negligible effect.

Next, I think this statement is commonly believed and simply wrong from an information perspective: " we cannot interpret the interval except for its properties in the long run." No: As noted just below that statement, the 95% interval can be interpreted for this single data analysis as the interval of all parameter values for which p>0.05. Thus the interval shows the parameter values for which the data supply less than log2(1/0.05) = 4.3 bits of information against them, given the model used to compute the interval. This interpretation does rely on the repeated-sampling property that, given the model, the random P across studies is uniform at the correct value; this property ensures that the S-value captures the refutational information in the test statistic (note that posterior predictive P-values are not uniform and thus do not sustain this interpretation).

Finally, if not clear from the above, I disagree that P-values are wisely dispensed with in favor of confidence intervals. Confidence intervals invite the distortion of 0.05-level dichotomization. The problem with P-values is that almost no one computes them for more than the null hypothesis. They should instead be given not only for the targeted “null” but also at least for one more hypothesis, including the protocol-specified alternative used to compute power or sample size.