Using Bayesian statistics in observational studies

My understanding is that @f2harrell argues that the utilization of Bayesian statistics for the analysis of observational data would “protect us” against false results. What are the main mechanisms for this? Utilization of skeptical priors? What other advantages does the Bayesian approach bring to the table in the world of observational studies? Are there any papers or books (with convincing examples :slight_smile: ) specifically dedicated to this topic?

1 Like

Bayesian analysis by itself won’t fix observational data. But it can help:

  • Instead of testing null hypotheses, quantify evidence for non-trivial effects, e.g. OR > 1.2. This cuts out multiplicity problems on its own.
  • Use very skeptical priors because most epi findings don’t replicate.
  • Formally model that a variable is only a surrogate for a real endpoint. This is akin to imputing the real endpoint from a surrogate one. Both that and joint Bayesian modeling will show that your effective sample size is much smaller than the apparent sample size. The Bayesian joint model might involve bringing in a second dataset that captures tendencies for real outcomes to develop given intermediate outcomes and baseline data.
  • Adjust for all measured confounders using shrinkage priors to keep them from overfitting.

One use of Bayesian and closely related methods such as penalization is for reduction of small-sample instabilities and artefacts via the introduction of “shrinkage” priors, as reviewed for example in
Greenland S, Mansournia MA, Altman DG (2016). Sparse-data bias: A problem hiding in plain sight. BMJ 353:i1981, 1-6,

Bayesian and related methods are central to probabilistic bias analysis (PBA), especially but not only for observational studies. PBA extends the model for the data-generating mechanism beyond the usual forms (e.g., logistic, log-linear, proportional-hazards, structural) to include models for uncontrolled bias sources such as nonrandom selection, confounding, misclassification and measurement error. The parameters in these model extensions are given prior distributions in order to identify the effect of interest, which can then be estimated using the full expanded model and priors via simulation techniques (which may be of Bayesian form).

There is now a large literature on PBA. Here is a book and some chapters that provide coverage:
Fox MP, MacLehose RF, Lash TL (2021) Applying quantitative bias analysis to epidemiologic data, 2nd edn. Springer, New York
Greenland S, Lash TL (2008) Bias analysis. Ch. 19 in Rothman KJ, Greenland S, Lash TL (eds) Modern epidemiology, 3rd edn. Lippincott-Williams-Wilkins, Philadelphia, 345–380
Greenland S (2014). Sensitivity analysis and bias analysis. Ch. 19 in Ahrens W, Pigeot I, eds. Handbook of Epidemiology, 2nd edn. Springer, New York, 685-706

Here are some introductory overviews on PBA which you may find in JSTOR or I can supply on request:
Greenland S (2005) Multiple-bias modeling for observational studies (with discussion). J R Stat Soc Ser A 168:267–308
Greenland S (2009) Bayesian perspectives for epidemiologic research. III. Bias analysis via missing-data methods. Int J Epidemiol 38:1662–1673, corrigendum (2010) Int J Epidemiol 39:1116

Some guidelines and cautionary discussions for bias analyses:
Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S (2014) Good practices for quantitative bias analysis. Int J Epidemiol 43:1969–1985
Lash TL, Ahern TP, Collin LJ, Collin LJ, Fox MP, MacLehose RF (2021) Bias analysis gone bad. Am J Epidemiol 190:1604–1612
Greenland S (2021) Dealing with the inevitable deficiencies of bias analysis – and all analyses. Am J Epidemiol 190:1617–1621
MacLehose RF, Ahern TP, Lash TL, Poole C, Greenland S (2021) The importance of making assumptions in bias analysis. Epidemiology 32:617–624


Wow. Sander as usual you provide days of reading of absolutely essential papers in the area. Thanks!!


Thanks Frank!

I should add that it is quite easy to do Bayesian and penalized-likelihood analyses with ordinary software by coding the priors as simple data records and adding them to the actual-data set. This data-augmentation (DA) method can be automated with simple commands. Evaluated relative to posterior sampling, DA runs fast, converges reliably, and provides accurate results.

An overview of DA with software illustrations for logistic risk, log-linear rate, and proportional hazards models is
Sullivan SG, Greenland S (2013) Bayesian regression in SAS software. Int J Epidemiol 42:308-17. doi:10.1093/ije/dys213
Important Erratum (2014) Int J Epidemiol 43:1667–1668,

Its use in bias analysis is illustrated in the Greenland 2009 paper cited above.

In my view, an important advantage of DA is this: By translating priors into equivalent data, it shows how much information the prior distributions is adding to the actual-data information. One can then see that many “skeptical” priors proposed and used in the medical literature are equivalent to far more information than is actually available in the application, and how such priors overwhelm the actual-data information. Thus the DA approach alerts one to “tune down” such priors to contextually reasonable levels.

DA also serves as a cross-check for both ordinary maximum-likelihood and posterior-sampling analyses, which can suffer convergence problems that go undetected by software and users (as illustrated in the Greenland-Mansournia-Altman 2016 paper cited above).


Forgot to add this paper providing logistic data augmentation for Stata:
Discacciati A, Orsini N, Greenland S. Approximate Bayesian logistic regression via penalized likelihood by data augmentation. Stata J 2015;15:712-36

and the Greenland-Mansournia-Altman 2016 paper cited above links to R code for the same.


Since we are discussing papers by @Sander, I’d like to add a commentary he wrote for Statistical Science that discusses the application of both “Frequentist” and “Bayesian” concepts in a reasonable way. A number of sections touch directly upon the question asked in the original post.

Greenland, S. (2010). Comment: The need for syncretism in applied statistics. Statistical Science, 25(2), 158-161. link

He discusses how hierarchical models provide a framework for integrating the considerations of prior information that motivate a particular investigation, with the methodological concerns for study designs that have the characteristics that a frequentist would accept (proper coverage of interval estimates or prediction intervals in repeated use).

The topic of hierarchical models is very important; I’d claim all of Sander’s insights and experience have been formalized by the engineers, computer scientists, and logicians who work in the area known as Subjective logic, which is an adaptation of Dempster-Shafer belief functions so they are compatible with the probability calculus. Applications include diverse areas from electronic sensor fusion to intelligence analysis.

Oren, N., Norman, T. J., & Preece, A. (2007). Subjective logic and arguing with evidence. Artificial Intelligence, 171(10-15), 838-854. link

From the link on Subjective Logic:

Subjective logic is a calculus for probabilities expressed with degrees of epistemic uncertainty where sources of evidence can have varying degrees of trustworthiness. In general, subjective logic is suitable for modeling and analysing situations where information sources are relatively unreliable and the information they provide is expressed with degrees of uncertainty. For example, it can be used for modeling subjective trust networks and subjective Bayesian networks, and for intelligence analysis.

The entire literature on statistics makes it seem as if there are distinct techniques of “Bayesian” statistics and “Frequentist” statistics. As Sander’s papers demonstrate, this is an illusion for any practical problem.

After much reading on the topic, I’d hope the nomenclature of “Frequentist vs Bayesian” disappears, and focuses on the real area of disagreement – how much weight to place on prior (often qualitative, imprecise, but critically relevant) information with the information obtained from a data collection protocol.

Bayesians and “Frequentists” disagree on how prior information should be used to derive a data collection protocol, with Frequentist designs often formally ignoring prior information.

This difference of opinion between proponents of a model, and skeptics, can be given a game theoretic interpretation, which also happens to be an algorithm to derive a decision procedure that will settle the dispute.

Game Theoretic Probability and Statistics

A complementary method of bridging the gap between Bayesian and “Frequentists” is the Objective Bayes perspective, with the most work I am familiar with being done by James Berger and Jose Bernardo. Papers can be found under the term Reference Bayes or Reference Analysis.

A reference posterior distribution is an effort to obtain a posterior distribution, using mathematics alone, that has interpretations that would satisfy a Frequentist (who is concerned about the frequency coverage of the interval generated by data collection procedure) and mostly satisfy a Bayesian (who is concerned about the information contained in the experiment and how it relates to the scientific questions).

A more precise, technical definition of a reference prior (used to produce an Objective Bayes posterior) is:

Reference analysis uses information-theoretical concepts to make precise the
idea of an objective prior which should be maximally dominated by the data, in
the sense of maximizing the missing information (to be precisely defined later)
about the parameter.

Sander has this to say about Reference Bayes:

Accuracy of computation becomes secondary to prior specification, which is too often neglected under the rubric of “objective Bayes” (a.k.a. “please don’t bother me with the science” Bayes).

My question to Sander:
The criticism of Objective Bayes seems to assume that an analysis based upon a reference posterior will naively be interpreted as definitive.

This isn’t necessarily so. Why can’t reference posteriors be used as inputs to the probabilistic bias analysis you described in the numerous papers cited? The cognitive science literature done in the context of training intelligence analysts strongly suggests expressing uncertainty so that it obeys the probability calculus improves reasoning ability. Why can’t there be room for both an “Objective Bayes Posterior” and a “Subjective Bayesian” probabilistic bias analysis? As Senn has noted: “Bayesians are free to disagree with each other.”

Karvetski, C. W., Mandel, D. R., & Irwin, D. (2020). Improving probability judgment in intelligence analysis: From structured analysis to statistical aggregation. Risk Analysis, 40(5), 1040-1057. link

As I had pointed out in other threads, it is often useful to see a consensus predictive distribution to determine the value of your own subjective knowledge. This occurs routinely in financial valuation contexts.


I find DA to be intuitive and helpful, but hard to do for more general prior specification. With Stan it is easy to specify priors on advanced things such as how nonlinear a relationship is for a predictor in an interval. This is done by specifying priors on contrasts. That would be hard to translate to data. Examples are here.


Frank: It’s not hard to translate most any credible prior into data if you know the general trick of how to transform a prior distribution into a pseudo-data record. Of course there is no computational need for that if you are fine with (say) Stan. The problem is knowing how much information your prior is adding, and justifying that information: Expressing a prior as data gauges the information the prior is adding in rather stark terms; it often (I find usually) reveals that what seem like intuitively reasonable priors correspond are far more information than is actually justifiable based on previous studies.

But then, when used for individualized predictions, standard sampling models used by almost everyone (including me) add enormous amounts of questionable information about dose-response and interactions. For example, by leaving out second-order (quadratic) terms such as products (as done in most medical reports) and using the resulting first-order model for inferences on individual risks, we are assuming all those omitted terms are zero. That corresponds to a mean-zero, zero-variance joint prior on the omitted terms, which is an infinite amount of information in an inverse-variance (Fisherian) sense, and can lead to oversmoothed (highly biased and overly precise) predictions for individuals not near average in all respects.

I think such problems are among the reasons why the machine-learning literature favors algorithmic predictive modeling, growing the model from the data using an algorithm unrestricted by standard (semi)parametric forms. As usual though, there is no free lunch: To display their advantages, common algorithmic modeling methods require far more data than most med studies have, and don’t produce a transparent equation relating predictors to predictions. That’s why we fall back on narrow model forms that risk drastically oversmoothing the predictions. There’s been a lot written on dealing with that problem, such as including all second-order (as well as first-order) terms, but with heavy shrinkage of the second-order terms to keep the effective degrees of freedom down. That is easily done with data augmentation, where each second-order term gets its own simple pseudo-record as described on p. 199 of
Greenland S (2007). Bayesian perspectives for epidemiologic research. II. Regression analysis. Int J Epidemiol 36:195-202


Thanks R_cubed for the comments and citations. Subjective logics have been a big area in AI/CS and philosophy for at least the past half century. I think Ramsey and DeFinetti viewed probability as the normative form for subjective logic, which would make Bayesian bias analysis a subfield of subjective logic.

Unlike bias analysis though, the more general formalisms don’t appear to me to have made serious inroads into applied statistics (as opposed to AI/ML systems, where by the 1980s there were things like Pearl’s Bayes nets), perhaps because they don’t seem to lend themselves as easily to familiar software formats.

Not that bias analysis has made huge inroads. Its prototypical forms appeared in the 1940s-1970s (e.g., Berkson, Cornfield, Leamer) - exactly the era in which instead mindless significance testing came to overwhelm research output. And bias analysis still seems far removed from the mass of teaching and research. That is unsurprising given how even today, few researchers seem to grasp why a P-value is not “the probability that chance alone produced the association”, and few evince an accurate understanding of either frequentist or Bayesian methodology - as displayed by their misinterpreting P-values as if they were justified posterior credibilities and CIs as if they were justified credibility intervals.

You remarked that my criticism of reference (“objective”) Bayes seems to assume that an analysis based upon a reference posterior will be naively interpreted as definitive. That’s not my assumption so much as my observation from my experience. The same problem happens with “classical” statistical methods; it’s just that the Bayesian applications I see are no better.

For hypothetical ideal users (and maybe us here, but not most users) Fisherian-frequentist and reference-Bayesian methods each supply summaries of data information filtered through their assumed models, in the form of interval estimates. In typical med research their summaries are almost identical numerically, and reference-Bayes estimates can be viewed as bias-adjusted frequentist estimates (e.g., see Firth, Biometrika 1993).

The problem I see is that reference-Bayes results will get misinterpreted as supplying justified credibilities when they are no such thing in reality. The very existence of medical trials on a relation demonstrates there must be a lot of background (prior) information - enough to have gotten the trials approved and funded. This background voids the reference-Bayes output as supplying a scientific Bayesian inference, revealing it instead to be a model-based study-information summary of Bayesian instead of frequentist probability form. As long as these distinctions are made clear, I’m fine with reference Bayes.


The flaw in this interpretation of an objective Bayes estimate and distribution is easier to demonstrate in a Subjective logic framework, as it has explicit notations for first order (aleatory) probabilities (ie. frequencies), and uncertainty about the accuracy of those probabilities (ie. epistemic probabilities).

If one were to bet a fraction of one’s wealth (at the Kelly Criterion computed proportion) on a model that underestimates uncertainty, risk of ruin approaches 1 a the number of bets increases. Anyone familiar with advantage gambling or the financial markets knows that no one bets the actual Kelly criterion amount. It is often a small fraction of it for this very reason.

This phenomenon of taking models too literally actually happened in the early stock index option markets. Prices before the 1987 crash closely matched the Black-Scholes model. Afterward, index options had an implied volatility skew, which doesn’t make sense unless you understand this is taking into account model uncertainty as well.

Almeida, C., & Freire, G. (2022). Pricing of index options in incomplete markets. Journal of Financial Economics, 144(1), 174-205. link

The heterogeneity of marginal risk-neutral measures is time-varying and provides novel insights into the skew patterns of index options before and after the October 1987 global stock market crash. Although before the crash the implied volatility curve was flat, there was a reverse implied smirk indicating that OTM puts (and ITM calls) were actually cheaper, in relative terms, than ATM and ITM puts (and OTM calls). This is evidence that investors were pricing options using the Black-Scholes formula, even though this was inconsistent with the physical distribution. In fact, if the physical distribution were lognormal, the implied curve would also have been flat. In contrast, physical underlying returns indicated relatively high probabilities of a crash before October 1987, which were not incorporated by investors in the option market.


Thank you all very much for such informative responses! I am reading through the references and picking up a lot of knowledge :grinning: