How to interpret “confidence intervals” in observational studies

s_doi · August 22, 2025, 9:34am

@ChristopherTong , I am not going to push back on what you have said as methodologists have the right to decide what they accept or reject. However methodologists tend to focus on whether observational studies can produce “true” or “unbiased” estimates, or whether probability theory is rigorously justified in messy data.

My concern is not this. Medicine and public health rarely operate in a vacuum of certainty. Decisions must be made in the presence of imperfect evidence, where waiting for ideal conditions would be actively harmful.

The real question is utility: Does the method help us make better decisions?
Probability math, causal inference, and approaches like target trial emulation are tools to reduce error, quantify uncertainty, and guide action, even if they do not produce “perfectly unbiased” truth.
Focusing on theoretical purity — “We can’t trust observational data because it’s not a randomized trial” — is neither pragmatic nor useful. It leaves clinicians and policymakers paralyzed or forced to rely on anecdote.

John Snow didn’t use probability theory, but he made a decision that saved lives, not because he achieved statistical purity but because he acted on the best available evidence in a structured, thoughtful way. Similarly, modern methods allow us to extract decision-worthy insights from observational data, even when uncertainty remains.

The key point then is the utility for decision makers. Deciding whether a method is “perfect” is irrelevant if it leads to worse outcomes. The proper standard is “Does this improve decision-making?”, not “Does this satisfy armchair notions of statistical idealism?”, a standard that need not bother methodologists but should bother decision makers in Medicine.

s_doi · August 22, 2025, 10:07am

Quote from Erica Thompson: “Models are not irrelevant, the answer is not to throw them away. If you come to it from this sort of skeptical position saying we need to think more carefully how we make inferences from models you could go all the way down the rabbit hole and say oh they are all terrible let us just throw them away. But I think that would be completely unjustified. We have good evidence from one end of the spectrum of relatively simple linear models that are incredibly widely successful and form the foundations of modern life and modern technology. We can work from there to find the limits of more complex models which are looking at making predictive statements in more extrapolatory domains where the underlying conditions are changing and we therefore have less ability to rely on what I call the quantitative route of escape from model land.”

f2harrell · August 22, 2025, 11:38am

This point seems to not be impressive to anyone but I keep stressing the importance of expanding the word model to include Bayesian models that include “design weakness parameters” to give us the proper degree of caution when interpreting results when there are design weakness such as lack of randomization or unmeasured confounders. Bias parameters that are not supported by data (e.g., you are using historical controls with no concurrent controls) will result in the most conservatism of results (wide posterior distributions for an exposure parameter). What I’m describing is a formalization of simulating unmeasured confounders in one case.

R_cubed · August 22, 2025, 1:43pm

Perhaps because information and probability are complements, even if repeated sampling does not hold. The entire literature on Graphical Bayesian Trust modelling and reputation systems is fundamentally probabilistic.

This notion that probability only applies to observable frequencies is a frequentist hang up that had one time advanced statistics, but now holds scientific thinking back.

ChristopherTong · August 22, 2025, 3:51pm

John Snow did not rely on a single set of data; rather he used what Ehrenberg called “Many Sets of Data” or what Davey Smith would call “triangulation”, and Freedman “shoe leather”. This is historically how a great deal of science has proceeded before the statistical revolution of the early 20th century. The use of statistical inference since that revolution has been used by many to short circuit this process of scientific evidence-building, though many (Box, Freedman, etc) tried to remind statisticians otherwise.

I agree that the real question is utility, and taking UIs reported in observational studies too seriously can (as in the cases of medical reversal cited above, or those noted by @ESMD) be itself actively harmful. A good investigator, well read in the history of science (or at least the specific science in which they work) develops a sense of when the quality and quantity of evidence is sufficient to make a decision (as Freedman said, there is no objective rule for this) and there may well be disagreement among sincere investigators examining the same set of evidence. It is our job to make sure that they properly understand the profound limits of UIs especially in the setting of observational studies. One of these limits is the lack of (frequentist) interpretability in this setting. If the causal inference folks have a different perspective, I am desperate to learn about it!

ChristopherTong · August 22, 2025, 4:02pm

Excellent! Also from Chapter 1 of her book:

Though Model Land is easy to enter, it is not so easy to leave. Having constructed a beautiful, internally consistent model and a set of analysis methods that describe the model in detail, it can be emotionally difficult to acknowledge that the initial assumptions on which the whole thing is built are not literally true. This is why so many reports and academic papers about models either make and forget their assumptions, or test them only in a perfunctory way. Placing a chart of model output next to a picture of real observations and stating that they look very similar in all important respects is a statement of faith, not an evaluation of model performance, and any inferences based on that model are still in Model Land.

One of the points of this thread is that the assumption that a probability model may be relevant has greater plausibility when a mechanism similar to those used in games of chance – random allocation or random selection – are built into the study design. This is missing in observational studies, so the assumption (at least for frequentists) makes the interpretability very flimsy.

Pragmatically, I see 3 courses of action, two of which I outlined in post 3, and the third is what @f2harrell proposed. Either don’t report UIs, or caveat the heck out of them when reported (@Pavlos_Msaouel formulation of the caveat seems like a good one), or build into the model up front a procedure of injecting a more humble accounting of model uncertainty, which will result in UIs broader than their frequentist counterparts, appropriately so.

ChristopherTong · August 22, 2025, 4:10pm

Of course there is a 4th option I already mentioned, and as someone who worked directly with decision makers, one I used many times. We simply don’t know enough to make a decision at this point. Let’s get more and/or different data to help us get there.

trumanfrancis · August 23, 2025, 2:56pm

Have you ever heard of Gary Klein’s method of the “pre-mortem” for decision making? Within the group tell them to get out a crystal ball and imagine its a year after the decision has been made and it turned out to be a complete disaster. What information or ideas do you have at this time that indicate there is some probability this decision could go wrong. It allows everyone in the group to voice their reservations and concerns either based on available information or the lack of it. Normally they would not speak up because once a decision gains momentum the critics often feel its not safe or proper to voice their concerns. Ive never used it but it sounds reasonable. Perhaps a method for de-biasing.

Pavlos_Msaouel · August 23, 2025, 4:59pm

Yup. Every experiment (including clinical trials) designed by our group goes through a premortem deliberation. We strongly believe that more information can potentially be gained when we are wrong – if prepared for it. Less formally, this is also being done in clinical decision-making.

s_doi · August 24, 2025, 7:41pm

Frank, I am not sure what you mean by design weakness parameters but surely that is not the domain only of Bayesian models. I have done that for bias adjustment in frequentist models of meta-analysis. I will clarify (for clinicians in clinical language) why I think Bayesian models fare no better than frequentist models with observational data and you can tell us what you think.

Clinicians are often told that Bayesian analysis offers a fundamentally different approach to interpreting evidence compared to traditional statistics. It is important to understand, however, where the difference truly lies. At their core, p-values, uncertainty intervals (UI), and Bayes factors are all transformations of the same underlying test statistic, usually a z-score derived from the data. A small p-value, a UI that excludes the null, and a large Bayes factor in favor of an effect are all different ways of expressing the same signal in the data. This is why, mathematically, a Bayes factor is often simply a re-expression of what a p-value already tells us: the observed data would be unlikely if there were truly no effect.

Consider a statin trial as an example. Suppose the odds ratio for preventing cardiovascular events is 1.5, with a 95% UI of 1.2 to 1.9. The UI excludes the null, the p-value is very small, and the z-score translates into a Bayes factor of roughly 400:1 in favor of benefit. All three measures—p-value, UI, and Bayes factor—are reporting the same underlying information: the observed effect is far more compatible with a treatment effect than with no effect. In essence, the mathematical content is the same; only the format or scaling differs.

The true value of Bayesian analysis emerges when we incorporate a prior, representing our existing knowledge or beliefs about the effect before seeing the new data. Suppose we believe, based on previous trials and biological plausibility, that statins have a 40% chance of benefit in this population. Multiplying that prior by a Bayes factor of 400 produces a posterior probability near 100%, indicating high confidence in benefit. If, alternatively, we were skeptical and believed there was only a 0.1% chance of benefit, the same Bayes factor shifts our posterior only to around 30%. Crucially, the data, the p-value, the UI, and the Bayes factor are identical in both cases. The difference in conclusion arises entirely from the prior.

This leads to an important point about observational studies. Critics like @ChristopherTong often argue that observational study UIs are “not interpretable” because systematic errors and confounding can make the intervals misleading. This critique does not disappear with Bayesian analysis. While Bayes factors and posterior probabilities can quantify uncertainty and combine prior knowledge with observed data, they cannot magically correct for observational study bias that was not dealt with. A misleading prior or an observational study affected by systematic error can produce a precise posterior that is confidently wrong. In other words, Bayesian analysis transforms p-values and UIs into probabilities conditioned on the prior, but it cannot make untrustworthy observational data inherently trustworthy.

Stephen Senn and others emphasize that the strength of Bayesian analysis is not in mathematically transforming test statistics—it is in forcing us to make explicit what we already believe and how we weigh new evidence against that belief. The posterior probability answers the question clinicians care about: given what was known before and what the new study shows, how likely is it that the treatment truly works? But the quality of that answer depends critically on both the prior and the trustworthiness of the data and therefore if you cannot trust the UI you cannot trust the Bayes factor derived from it.

NB This is an extreme BF of 400, may be smaller depending on how we compute it

ESMD · August 24, 2025, 7:59pm

This is an interesting idea. The problem with it is that it’s the patient who makes the final decision about which treatments he will accept and reject, not the clinician. Unfortunately, the observational study hype machine frequently drives a wedge between clinicians and their patients. Even if the clinician knows better than to allow a half-baked study finding to influence his therapeutic recommendations (because his training allows him to weigh, much more accurately than the patient, the medical pros/cons of a therapeutic decision), the patient is perfectly within his rights to allow the study he’s heard about on the news to guide his decision. And the result, potentially, can be disastrous.

Hypothetical Clinical Scenario:

A 50 year old man goes to see his family physician reporting severe acid reflux for the past couple of years. He’s been managing it, to date, with TUMS, with only modest relief. He’s exhausted, waking several times through the night because of his GERD. On history, he has no “red flag” symptoms (e.g., no weight loss/dysphagia/odynophagia/melena). However, he smokes 1/2 pack per day and has 15 alcoholic drinks per week. His family physician recommends that he start taking a PPI (in addition to advising him to reduce his alcohol intake, quit smoking, lose weight, and elevate the head of his bed). He agrees to start taking a PPI and says he’ll “work on” the other recommendations.

His symptoms remain under good control for the next 2-3 years, at which point he returns to see the physician, having just read, through his online news feed, about an “association,” identified in an observational study, between PPIs and dementia. He has neither quit smoking, nor reduced his alcohol intake. He has gained, rather than lost, weight. This was the first line of the “Conclusions” portion of the study abstract“:

https://jamanetwork.com/journals/jamaneurology/fullarticle/2487379?utm_source=Silverchair%20Information%20Systems&utm_medium=email&utm_campaign=ArchivesofNeurology:OnlineFirst02/15/2016#xd_co_f=MzkxYmQ3YTktNDMzNi00ODE2LWE4NTUtMzc5ZGRhMDczYmEy~

“Conclusions and Relevance The avoidance of PPI medication may prevent the development of dementia.”

Since he watched his mother struggle with, then pass away from, Alzheimer’s disease at age 85, he is terrified of developing dementia himself. In spite of his physician emphasizing that severe, prolonged untreated reflux can also have serious health consequences (including esophageal stricture, development of Barrett’s esophagus, and potential progression to esophageal cancer- indeed, the physician has lost several young-ish patients over the years to esophageal cancer), the patient decides to stop his PPI.

Five years later, now at age 58, the patient presents with several months of progressive solid food dysphagia and weight loss. Upper endoscopy identifies advanced esophageal cancer. The patient succumbs to his disease 6 months later.

Question: Knowing the inherent limitations of observational studies, how might the physician feel, looking back, about the widely-publicized observational study that caused his patient to abandon his PPI?

f2harrell · August 24, 2025, 9:01pm

I’ll have to disagree with all of this. You are envisioning a fixed sample size single look at the data situation. Correspondence between methods is lost once you get to sequential testing or adaptive clinical trials. But even in the simple fixed sample size one look situation the different methods are completely differently calibrated even though they may be monotonically related. This has profound impact on how decisions are made.

I can’t figure out your comment on how frequentist methods can deal with special parameters such as bias. In a Bayesian model, for example, you might have a parameter that is the amount of bias in observational data that you are pooling with experimental data. If you can’t put a somewhat sharp prior on that parameter, the observational data will be properly ignored. The sharpness of the prior in the bias parameter determines the applicability of the observational data. How do you handle that in the frequentist world?

ChristopherTong · August 24, 2025, 10:00pm

I was unaware of the specific notion of premortem trial assessment, but it is compatible with the point I’ve made elsewhere that much of what pollutes published research are issues with study design and execution, an issue that fails to receive the attention it deserves, when the debates over p-values, null hypothesis testing, Frequentist-vs-Bayesian-vs-Likelihoodist, etc, eat up so much oxygen.

ChristopherTong · August 24, 2025, 10:10pm

Trying to convey the stronger internal validity of RCTs vs. observational studies would be the obvious solution, but this can be a nuance lost even when I speak to other (nonclinical) scientists, unless they are patient enough to hear me out, and haven’t been lost to their own confirmation bias yet.

Aside from this, i wish these quotes from Clinical Trials: A Methodologic Perspective (by Piantadosi) were true and more widely accepted:

Medicine is a conservative science and behavior usually does not change on the basis of one study.

and

Readers of clinical trials tend to protect themselves by reserving final judgment until findings have been verified independently or assimilated with other knowledge.

ESMD · August 25, 2025, 12:41am

I nailed down my spiel for patients many years ago. I use it regularly to explain why they shouldn’t abandon important therapies based on hyped observational studies touting weak harm signals with uncertainty intervals that don’t quite cross the null. If I put in enough time to explain things, most patients will understand the reasoning. But it’s all so unnecessary- researchers should know better.

My own “real-world evidence” re the harms from over-interpreted uncertainty intervals from observational studies:

Severe patient anxiety;
Patients abandoning therapies that can prevent important disease without first consulting their physician;
Morbidity from abandoning therapies which were providing important symptom relief and improved quality of life;
Incalculable cost to my publicly-funded healthcare system from medical appointments required address the latest media scare story.

The harm caused by prematurely publicizing research findings that aren’t ready for clinical “prime time” can only be counteracted by clinicians who have the time and patience to explain the pros/cons of various research methods to their patients. Many likely just throw in the towel after awhile, saying “if the patient trusts what he hears from some news outlet more than what I’m advising, then that’s his problem…”

s_doi · August 25, 2025, 3:47am

I can give an example. If we start with the existing evidence, say five studies on the same question, we are looking at a synthesis based on random error alone (inverse variance weights). We can bring in systematic error through bias assessment but given that bias assessment cannot be translated to a direction or magnitude of bias (we have no idea of the latter given an assessment) the logical thing to do is to create a relative ranking of the studies (from 1 downwards where 1 is the top ranked study) and a simple way of ranking is to rank based on number of safeguards in design and execution of the study. Once done, the inverse of variance weights can be redistributed based on rank across the studies. The pool of inverse variance weights does not change but the individual weights do and thus if a large study also is ranked high nothing changes but if it is ranked low a change happens. Bias adjustment now happens without needing to know the magnitude or direction of the bias induced by absence of safeguards and the implicit assumption here is that random and systematic error are equally important. These redistributed weights however cannot be used in error estimation (the UI of the synthesis) so that would need a quasi-likelihood approach.

Note that the random effects models do not achieve bias adjustment as redistribution is in one direction only - from larger to smaller studies and the aim of the originators was correcting error estimation at which the model fails miserably.

trumanfrancis · August 25, 2025, 11:37am

I believe I heard a quote that “is associated with” is the most dangerous phrase in medical research.

f2harrell · August 25, 2025, 12:22pm

In my opinion there is nothing appealing about this approach. It is very subjective and may not work when there is only one previous study. This is a far cry from what a Bayesian bias model is able to do IMHO.

s_doi · August 25, 2025, 5:52pm

I agree there are interpretational and utilitarian benefits of the Bayesian approach but we get into similar subjectivity when we talk about observational data and bias adjustment as we need to make assumptions regardless of which path we choose.

f2harrell · August 25, 2025, 7:28pm

Fair enough, but I observe that the assumptions in a Bayesian model are more staring you in the face and in many cases the assumptions are concentrated into one item: the variance of the prior for the bias parameter. As an example suppose a sponsor wanted to use historical controls to compare with a new treatment and reviewers are concerned about biased enrollment, biased outcome assessment, secular trends, and covariate shifts in the historical controls. All of those effects can be rolled into a single bias parameter.