What are credible priors and what are skeptical priors?

prior
interpretation
bayes

#1

A few weeks ago Dan Scharfstein asked a group of colleagues about how to report an odds ratio of 1.70 with 95% confidence limits of 0.96 and 3.02. Back-calculating from these statistics gives a two-sided P of 0.06 or 0.07, corresponding to an S-value (surprisal, log base 2 of P) of about 4 bits of information against the null hypothesis of OR=1. So, not much evidence against the null from the result, but still favoring a positive association over an inverse one, and so thought worthy of reporting as such. The problem was that the journal to which this was submitted was still using the magic 0.05 alpha level as a reporting criterion.

The possibility of citing Bayesian alternatives was raised. The log odds ratio β = ln(OR) has no logical bound, so the classical (Laplacian) P-value would be the posterior probability of β ≤ 0 (OR≤1) from an improper uniform prior on β (e.g., see Student 1908), which equals half the usual two-sided P-value for β=0. That’s a bit over 0.03 in this example, so squeaks under 0.05. It’s not clear however that this result would mollify a journal stuck on a 0.05 cutoff, and might be seen as P-hacking. Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.

Almost as absurd in typical applications are the proper priors that are billed as “reference priors” or “weakly informative priors,” such as those based on rescaled t distributions with few degree of freedom. For example the upscaled t1 (Cauchy) prior for β proposed by Gelman et al. (Annals App Stat 2008, 2, 1360–1383) assigns 26% probability to OR>25 or OR<0.04. The Jeffreys prior for β fares little better, assigning 25% probability to the same event. Worse, yet often seen, are normal(0,10^n) priors with n≥2, which assign over 75% probability to OR>25 or OR<0.04. These are huge probabilities for associations so large that no one would believe them in typical “soft-science” applications: Reported associations that large are almost always numerical artefacts, and if real would have been obvious from almost any replication attempt. So these priors and other “weakly informative” or “reference” priors are nowhere near what anyone believes in typical contexts; in other words, none of these priors are credible. Why then should we take seriously a posterior probability generated using one of these contextually incredible priors? At best, it might serve as a weak bound on credible posterior probabilities derived from informative priors, but the ordinary one-sided P-value already fills that role (see Casella and Berger 1987, “Reconciling Bayesian and frequentist evidence in the 1-sided testing problem”, Journal of the American Statistical Association, 82, 106-135).

Frank Harrell suggested instead giving the posterior probability of a positive association from a “skeptical” prior, but did not specify what such a prior would look like. He wrote that “Skeptical in my mind is a prior that on the log scale [here, for β] has mean 0 and variance chosen so that the probability of a large effect (or prob. of an effect less than the reciprocal of that) is say 0.05.” That leaves wide open at least one prior parameter (the variance), and leaves open even more if one allows non-normal priors. Thus, in opting for a contextually credible Bayesian definition of “skeptical” we’ve set ourselves a specification task requiring unfamiliar choices. Furthermore, any proposed default can generate many objections, as just discussed for vague “reference” priors. No wonder then that Bayesian methods haven’t caught on.

Still, I’m going outline an approach I’ve used to operationalize “large effect” in an epidemiologic context. To start, I want a prior not too skeptical or informative, as I would not want to obscure potentially important relationships or overweight prior information from “experts” (who in my experience are often grossly overconfident about effect sizes based on rather weak previous evidence). On the other hand, I usually want to shrink down the sort of inflated associations that get highlighted by selecting on “significance” (an inflation that gets worse as the cutoff is dropped, as when multiple-comparison adjustments are used). Focusing on the epidemiology of rare (and thus hard to study) diseases like cancers, I have noticed that effects get called “large” when the OR is outside the range of about ¼ to 4. In the rare-disease case I would thus compare the one-sided P-value for OR≤1 to the posterior probability of OR≤1 under some sensible if neutral prior, e.g., “large effect unlikely” represented by using a prior on β = ln(OR) symmetric about 0 that produce a 95% prior interval for the OR=exp(β) of (¼,4). To gauge the impact of the prior, I would also compare the resulting 95% posterior interval to the 95% confidence interval.

That leaves the shape of the prior to be determined. A familiar form that yields the desired 95% prior interval of (¼,4) for OR is the normal(0,½) distribution for β (lognormal for OR). This choice also places about 2:1 prior odds on the OR interval (½,2). Nonetheless, a choice I prefer over the normal for more elegant computation and connection to prior information in logistic regression is the conjugate prior for β, the log-F distribution (F for OR). Equating the numerator and denominator degrees of freedom produces a log-F(m,m) prior for β, which like the normal is unimodal and symmetric around its mode; and the log-F(9,9) prior distribution produces a 95% prior interval for OR of (¼,4).

Since the log-F(m,m) is rather unfamiliar, as an aside I give some details: Although it has heavier tails than the normal, it approaches normality rapidly as m increases, and for m≥9 produces results negligibly different from a normal with the same 95% central interval. The shared degrees of freedom m is the number of independent Bernoulli trials with parameter π = expit(β) = exp(β)/(1+exp(β)) = OR/(1+OR) needed to encode the information about β in the prior. This relation enables the user and reader to grasp the evidential strength of the prior in terms of observing m tosses from a coin-tossing mechanism. For example, specifying a 95% prior interval for OR of (¼,4) is claiming to have prior information on OR = π/(1−π) equivalent to the information on the odds of heads vs. tails obtained from observing 4 or 5 heads in 9 independent tosses. Given the log-F convergence to normality, the same interpretation could be given to the normal(0,½) prior. For more about encoding prior information and computation with log-F priors see Greenland 2007, “Prior data for non-normal priors,” Statistics in Medicine, 26, 3578-3590.

OK, so there’s my working answer to Frank’s suggestion for a skeptical prior, at least for a “rare” disease. Applying the normal(0,½) for β to Dan’s 95% confidence interval for the OR of (0.96, 3.02) yields an approximate posterior mean and 95% interval for OR of 1.58 and (0.93, 2.68), with a posterior probability (“Bayesian P-value”) for OR≤1 of 0.046 (those numbers were computed using inverse-variance-weighting as per Ch. 18 of Modern Epidemiology 3rd ed. 2008 - and no, I did not cook them to keep the probability of OR≤1 below 0.05).

However, Dan’s example involved an outcome which may not be rare. For common (and thus more precisely estimated) outcomes like myocardial infarction “large” tends to mean outside the range of about ½ to 2. So I expect a genuinely informed skeptical prior for the OR in his application would need to be much narrower than normal(0,½), although how much narrower depends on contextual details which were not given. Finally, due in part to the counterintuitive behavior of odds ratios for common outcomes (including noncollapsibility), I would not have reported an OR as the final estimate – I might have instead reported standardized risk ratios or differences (“marginal effects”) derived from a logistic model.


#2

Sander this is very clear, well set up, and convincing. All I can add is a bit more context.

In my view we need to be pre-specifying primary study analyses that are Bayesian, and to settle the choice of prior (whether skeptical or previous data- or knowledge-based) before the choice could possibly be influenced by the results, or be manipulated by unscrupulous investigators. That being said, there is great value in Bayesian interpretation of already-completed results. Since for this setting it is not possible to find an already pre-specified prior, a skeptical prior is often fitting. I can think of two modes for choosing such priors:

  • Finding the ultimate judges of the research and elicit skeptical priors from them, or
  • Do as Sander wrote and select a reasonable skeptical prior that is likely to either convince most skeptics or to just make large effects unlikely. The latter assumption is very plausible in most areas of research. Even if the observed effect was equivalent to a “cure” (effect ratio of 0.0) the Bayesian posterior median effect would remain very impressive under such a skeptical prior.

The second option is more feasible. As with Sander, I take ‘skeptical’ to mean equal probability of harm as benefit. More specifically I take the prior probability that the effect ratio < r to be equal to the prior probability that it is > 1/r. I often simplify this to solving for the variance of a normal prior distribution for the effect log ratio, but Sander advocates a more flexible F-distribution-based approach, which I also like.


#3

What would you suggest for priors in a situation where an effect is much more likely in one direction? This is fairly common but I don’t recall ever seeing it addressed. Examples might be things like therapist interventions for back pain, where it’s pretty unlikely to make the pain worse, but could plausibly do very little. Another example might be effects of antibiotics on infection - they almost certainly won’t make it worse but might not do enough (or have other unintended effects) to make them worthwhile.


#4

Two parts to my reply:

  1. First, your examples show how priors can be controversial and very dangerous. Some physical therapies can unexpectedly worsen injuries and hence long-term pain. For decades physicians prescribed low-fiber diets for bowel problems until evidence accumulated they were doing more harm than good. Antibiotics have often been given for what turned out to be infections by a virus, protist, or resistant bacterium, in which case the antibiotic can worsen illness by destroying competing, nonresistant bacteria and delaying initiation of effective treatment (not to mention producing direct adverse side effects, as with fluoroquinolones). The point is that use of asymmetric priors could delay recognition of costly effects in the unexpected direction.
  2. Still, there are special cases in which asymmetric priors seem reasonable and safe, e.g., where missing the unexpected direction entails no loss. I gave a detailed example in 2003 (“Generalized conjugate priors for Bayesian analysis of risk and survival regressions,” Biometrics 59, 92-99) using a generalized-conjugate prior for logistic regression, with graphs of the prior. That prior reduces to the generalized (location shifted and rescaled) log-F(m,n) in the univariate case covered in my 2007 SIM article. The log-F allows skewness by using unequal degrees of freedom, with ln(m/n) functioning as a natural skewness parameter. See also Appendix 1 of Greenland 2009, “Bayesian perspectives for epidemiologic research. III. Bias analysis via missing-data methods,” International Journal of Epidemiology 38, 1662-1673 (corrigendum in IJE 2010, 39, 1116).

#5

Would a meta-epidemiological approach be admissible in this case? Say analyzing a large data-set of previous RCT results and using this to inform what is the appropriate prior probability of different effect sizes? Could include covariates for disease area, outcome type etc.


#6

I would say a meta-epidemiological approach makes some sense to me, but of course subject to all of the challenges that come with quantitative evidence synthesis (study quality, credible designs for causal effects, heterogeneity, etc). I have tried to take this approach in a recent paper (under review) that used prediction intervals from a random effects meta-analysis a starting place for thinking about the prior, with of course some sensitivity analyses for other priors or shapes. I’d be curious to hear if folks strongly objected to this kind of approach.


#7

Nice post @Sander . Although it is somewhat tangential to the title of this thread, I want to highlight some recent work at

http://metalogdistributions.com/publications.html

that I think will become important to specifying priors, at least in a language like Stan that is not concerned with conjugacy or anything like that.

The essence of this research is to use “quantile parameterized distributions”, which are essentially distributions whose parameters are quantiles. So, if you can specify or elicit a prior median for a parameter and at least two or three other quantiles, then it is possible to construct a probability distribution that has those quantiles. In the case of an unbounded distribution like that for a regression coefficient, there are a couple of quantile parameterized distributions, namely the simple q-normal and the megalog(istic). In the case of a distribution that is bounded from above and / or below, there are some good choices based on the Johnson system.

Anyway, for a regression coefficient, one would often set the prior median to be zero and then would need to set a few more quantiles based on what you think is the minimum value for a large effect. I’ll be talking about this idea and Stan a bit more Saturday at the R/Medicine conference if anyone is interested.


#8

Good question, which at the moment I can only answer with another question: If you are going to make such a meta-analytic prior, why not just create a meta-analysis of the entire body of literature including the current study? That might take no more effort and could be a more informative contribution to the literature.

A related notion is the idea that, for reporting, a frequentist analysis should take precedence over a Bayesian analysis by supplying a summary easily integrated into research syntheses and meta-analyses (see for example Stephen Senn’s writings on Bayes). After all, if we were doing a synthesis would we really want to combine Bayesian study results contaminated by and correlated through the priors used by each research group? I wouldn’t want to, with the exception that some reference Bayes results are actually better frequentist results than the usual ones (I am thinking especially of the 2nd-order bias correction created by Jeffreys priors).


#9

Triggered by Dan’s frustration, Sander provides valuable guidance on specifying a Bayesian analysis for the Odds Ratio (I get back to it below). However, my big point isn’t choice among Bayes/frequentist/fiducialist/whateverist, or prior specification; it’s that journals (or other institutions) are not willing to publish or post evidence that isn’t ‘definitive’ by some rather arbitrary definition. Failure to publish/post deprives the science and policy communities of valuable information in its own right and as input to a synthesis. Of course, a study needs to satisfy standards regarding design, conduct and analysis, but if it does so, results should be made available by posting the data or, at minimum, the likelihood based on a clearly communicated model. This proposal is a golden oldie, and for me is the dominant `ask’ generated by Dan’s frustration.

Now, back to Sander’s analysis. I support the idea of moving away from default priors, paying attention to the prior probability content of parameter regions including those for which (almost) all would agree are unlikely. Probability content is key, low information associated with a big variance can be deceptive. For a frequently-cited example, the probability below 1.0 for a Gamma distribution with mean = 1.0, variance = 10,000 is 0.999, hardly uninformative. Taking `prior’ seriously, we should engage as much as possible in a priori, protocol-based prior elicitation (watch for Tony O’Hagan’s forthcoming article in The American Statistician, “Expert Knowledge Elicitation: Subjective, but Scientific”). Elicitation can be effective for low dimension parameters, but we likely need to rely on generic advice for some components of high-dimensional parameters, surely for their full joint distribution. For effective elicitation, transform the parameter space to create a low dimension subspace of interpretable parameters, likely in prediction space (Sander’s standardized risk ratio or risk difference are good examples), with the likely need to use defaults on the complementary subspace.

Let’s keep the conversation going!


#10

The idea is to specify a reasonable prior (for active treatment A vs placebo) if no relative effect trial has been done before. If exchangeability can be assumed then can I use the predictive distribution derived from an appropriate set of historical trials (of active treatments vs placebo) to produce a informative prior for A vs placebo?


#11

Perhaps I am not understanding your question as it raises for me yet another question: What is the basis for assuming exchangeability of effects with (other) active treatments? How is “the appropriate set of historical trials” and the treatments they examine determined? Aspirin for headaches? Penicillin for pneumonia? I don’t see enough detail for justifying a prior in the description so far… It’s hard to create credible priors because they need such details spelled out, along with a derivation of the prior from those details (where presumably the devil dwells). [For those who remember Anthony Quinn’s parting lines to Omar Sharif in Lawrence of Arabia, a paraphrase: “Being a credible Bayesian will be thornier than thou suppose!”].


#12

Thanks for all your input @Sander - really appreciate it.

Maybe more detail would help! I am doing a PhD to develop methods to help research bodies prioritize research proposals. To estimate the value from a proposed RCT am using a method from decision theory called value of information. This requires a Bayesian prior for the treatment effect for every trial proposal. This is an issue as in many cases there will have been no previous randomized trial with which to inform the uncertainty and eliciting quantitative expert opinion could be problematic (for a number of reasons).

What I want to do is to use off the shelf priors as a best guess on the range of plausible effect sizes from a set of historical studies. The historical set would be the entire set results from all studies funded by the research funding agency (all good quality and pre-registered) - could also adjust this for disease area, active/passive comparison etc. Constructing the prior would indeed involve combining previous studies from a wide variety of areas such as aspirin for headaches and penicillin for pneumonia. The idea is that if there is no knowledge about the effect of the current treatment then the best guess would be the effects seen for previous treatments. (there is an issue with selection here if the studies which were funded in the historical set were funded because they were expected to work and this expectation was correct)

There is currently very little quantitative consideration when deciding which studies to prioritize over others so my aim is to provide a quantitative starting point for discussion.

Any thoughts would be very much appreciated.


#13

If you have historical studies, then using one of the Quantile Parameterized Distributions that I alluded to above


is just a question of using the empirical CDF of the historical estimates to pin down the quantile parameters.


#14

OK DGlynn, thanks for the details.
I have to challenge some elements of your description and tentative proposal:

  1. “No knowledge about the effect of the current treatment”: That’s not credible. No one would come forward with an RCT proposal nor should it be seriously entertained by an IRB without enough theory and evidence that there should be a beneficial effect far outweighing potential harms, including biochemical, in vitro and animal studies and uncontrolled human studies, perhaps even up through phase I trials…
    https://en.wikipedia.org/wiki/Phases_of_clinical_research
    The question is then how to use that information in forming a prior for the effect, which is hard.

  2. “Wide variety of areas” makes no sense to me. I would find nothing credible about drawing analogies about aspirin for headache or penicillin for pneumonia. Effects in other studies would be exchangeable with the proposed study effect only to the extent their treatments and outcome variables resembled those in the proposal. One can attempt to model the degree of exchangeability via a 2nd-stage regression (e.g., as done with foods regressed on nutrients in Witte et al. 2000. Multilevel modeling in epidemiology with GLIMMIX. Epidemiology, 11, 684-688) but this takes immersion in the biochemistry of the exposure and the physiology of the outcome.

  3. As you note, drawing from funded studies would induce an optimistic bias. Going to the broader literature there is also optimistic publication bias (failure to publish or even accessibly archive negative studies).

These 3 considerations are far from sufficient to form a prior, but I think they are necessary to form a credible prior in your example.
They enter into a narrative approach which systematically discusses the evidence that falls under 1-3: the direct background evidence as in 1, the indirect (partially exchangeable) evidence as in 2, and the evidence about biases in the direct and indirect evidence as in 3. That approach is I think an essential prerequisite to forming a contextually credible prior, because it helps us discern what such a prior should look like in vague outline.

In my experiences (using considerations like 1-3 when doing Bayesian risk analyses) the predictive improvement over the narrative that one gets from a quantitative analysis is often not enough to justify the labor, reduced transparency, and unavoidable dependence on arbitrary assumptions of the quantitative analysis. Worse, from what I’ve seen the priors can do more harm than good, for example by introducing biases from assumptions of convenience that are contextually absurd but unrecognized as such because of their mathematical formulation (common when no one correctly connected the math to the contextual information). Classic cases include Bayesian analyses that assume prior independence of rates across categories, or that place spikes at the null when there is no evidence supporting the null or indicating it should receive special weight.

Sorry if all that sounds discouraging, but I do believe noncredible Bayesian inferences and decisions can do worse than traditional inferences and decisions (narratives and judgments bolstered by simple frequentist statistics - which of course can be awful if done badly). I think the quality of both cases hinges on the quality of the contextual narrative, including the validity of the connection between the context and the model parameters and priors.


#15

This is very interesting. The quality of this forum is unparalleled.

I have been reading through Statistical Rethinking lately and exploring Bayesian methods.

Skeptical priors seem most defensible and reflect how the medical community already interprets frequentist results. Beyond that I am worried that the contextual narrative is of very questionable validity in much of medicine and anyone trying to defend a prior other than a skeptical prior will see little support for their particular choice.


#16

Excellent thoughts, again. In the spirit of Andrew Gelman’s description of “type S errors” (getting the wrong direction for the treatment effect) I do want to suggest that having priors that merely exclude impossible or highly improbable values (e.g., a previously undiscovered risk factor having an odds ratio > 5) is usually better than putting no restriction on the priors, as done by frequentist inference and by Bayesian methods that use uninformative priors. (This is consistent with your initial post).


#17

That sounds very interesting - do you have a shareable copy that I could see?


#18

Many thanks @Sander , not the response I was hoping for but I really appreciate the time and thought put into the answer.


#19

I don’t know enough about what has already been done here, but it seems timely-- given the growing recognition that p-values are problematic but without popular solutions being put forth-- to reanalyze the results of a couple of hundred trials with a skeptical prior.


#20

Dear Pavel,
I offer the following response from the perspective of an applied statistician whose main tasks have been analyzing data and writing up the results for research on focused problems in health and medical science (as opposed to, say, a data miner at Google):

Contextual narratives I see in areas I can judge are often of very questionable validity. So are most frequentist and Bayesian analyses I see in those areas. Bayesian methods are often touted as a savior, but only because they have been used so infrequently in the past that their defects are not yet as glaring as the defects of the others (except to me and those who have seen horrific Bayesian inferences emanating from leading statisticians). Bayesian methodologies do provide useful analytic tools and valuable perspectives on statistics, but that’s all they do - they don’t prevent or cure the worst problems.

All the methods rely rather naively on the sterling objectivity, good intent, and skills of the producer. Hence none of these approaches have serious safeguards against the main threats to validity such as incompetence and cognitive biases such as overconfidence, confirmation bias, wish bias, bandwagon bias, and oversimplification of complexities - often fueled by conflicts of interest (which are often unstated and unrecognized, as in studies of drug side effects done by those who have prescribed the drug routinely).

To some extent each approach can reveal deficiencies in the other and that’s why I advocate doing them all in tasks with severe enough error consequences to warrant that much labor. I simply hold that it is unlikely one will produce a decent statistical analysis (whether frequentist, Bayesian, or whatever) without first having done a good narrative analysis for oneself - and that means having read the contextual literature for yourself, not just trusting the narrative or elicitations from experts. The latter are not only cognitively biased, but they are often based on taking at face value the conclusions of papers in which those conclusions are not in fact supported by the data. So one needs to get to the point that one could write up a credible introduction and background for a contextual paper (not just a methodologic demonstration, as in a stat journal).

Statistics textbooks I know of don’t cover any of this seriously (I’d like to know of any that do) but instead focus all serious effort on the math. I’m as guilty of that as anyone, and understand it happens because it’s way easier to write and teach about neat math than messy context. To remedy that problem without getting very context-specific, and what I think is most needed and neglected among general tools for a competent data analyst, is an explicit systematic approach to dealing with human biases at all stages of research (from planning to reviews and reporting), rather than relying on blind trust of “experts” and authors (the “choirboy” assumption). That’s an incredibly tough task however, which is only partially addressed in research audits - those need to include analysis audits.
It’s far harder than anything in math stat - in fact I hold that applied stat is far harder than math stat, and the dominant status afforded the latter in the statistics is completely unjustified (especially in light of some of the contextually awful analyses in the health and med literature on which leading math statisticians appear). That’s hardly a new thought: Both Box and Cox expressed that view back in the last century, albeit in a restrained British way, e.g., see Box, Comment, Statistical Science 1990, 5, 448-449.

So as a consequence, I advocate that basic stat training should devote as much time to cognitive bias as to statistical formalisms, e.g., see my article from last year: “The need for cognitive science in methodology,” American Journal of Epidemiology , 186, 639–645, available as a free download at https://doi.org/10.1093/aje/kwx259.
That’s in addition to my previous advice to devote roughly equal time to frequentist and Bayesian perspectives on formal (computational) statistics.