Language for communicating frequentist results about treatment effects

Sander · November 16, 2018, 4:21pm

Frank: This disagreement is central to the topic heading here as well as to its sister topic of communicating Bayesian results. In all this I am surprised by your failure to see the frequentist-Bayes mapping, which is key to proper use and description of both types of statistics. Please read Good (1987 at least; it’s one page).

If we treat the hierarchical model used for your Bayesian odds (prior+conditional data model) as a two-stage (random-parameter) sampling model, we can compute a P-value for the hypothesis that the treatment improves at least 3 of 5 outcomes. This computation can use the same estimating function as in classical Bayesian computation, where the log-prior is treated as a penalty function added to the loglikelihood. The penalized-likelihood-ratio P-values are then Laplace-type approximations to posterior probabilities (and excellent ones in regular GLMs). But even better the mapping provides checks on standard overconfident Bayesian probabilities such as “the” posterior probability that at least 3 of 5 outcomes are improved by treatment.

First, the mapping provides a cognitive check in the form of a sampling narrative: If treating the prior as a parameter sampler (or its log as a frequentist penalty) looks fishy, we are warned that maybe our prior doesn’t have a good grounding in genuine data. Such narrative checks are essential; without them we should define a Bayesian statistician as a fool who will base analyses on clinical opinions from those whose statistical understanding he wouldn’t trust for a nanosecond and whose beliefs have been warped by misreporting of the type seen in Brown et al. (which was headlined in Medscape as showing “no link”).

Going “full-on Bayes” is dangerous, not only when it fails to check its assumptions within a contextual sampling narrative, but when it misses Box’s point that there are frequentist diagnostics for every Bayesian statement. In your example a posterior odds of at least 3:2 on improvement would call for diagnostics on the (prior+data) model used to make that claim, including a P-value for the compatibility of the prior with the likelihood (not a “posterior predictive P-value”, which is junk in frequentist terms).

Frequentist results are chronically misrepresented and hence miscommunicated. Yes their Bayesian counterparts can be phrased more straightforwardly, but that’s not always an advantage because their overconfident descriptions are harder to see as misleading. “The posterior probability” is a prime example because there is no single posterior probability, there is only a posterior probability computed from a possibly very wrong hierarchical (prior+data) model. Similarly there is no single P-value for a model or hypothesis.

How is this relevant to proper interpretation and communication? Well, it would improve frequentist interpretations if they recognized that the size of a computed P-value may reflect shortcomings of the model used to compute it other than falsity of the targeted hypothesis. The hierarchical mapping tells us that Bayesian interpretations would also be improved if they recognized that the size of a posterior probability may only reflect shortcomings of the underlying hierarchical (prior+sampling) model used to compute it rather than a wise bet about the effect.

In my view, providing only a posterior distribution (or P-value function) without such strong conditioning statements is an example of uncertainty laundering (Gelman’s term), just as is calling P-values “significance levels” or intervals summarizing them “confidence intervals.” And I think it will lead to much Bayesian distortion of the medical and health literature. Even more distortion will follow once researchers learn how to specify priors to better ensure the null ends up with high or low posterior probability (or in or out of the posterior interval). Hoping to report the biggest effect estimate you can? Use a reference prior or Gelman’s silly inflated-t(1)/Cauchy prior on that parameter. Hoping instead to report a “null finding”? Use a null spike or a double exponential/Lasso prior around the null. Via the prior, Bayes opens up massive flexibility for subtly gaming the analysis - in addition to the flexibility in the data model that was already there (and which Brown et al. exploited along with dichotomania in switching from a Cox model to HDPS for framing their foregone null conclusions).