How to interpret “confidence intervals” in observational studies

ESMD · August 18, 2025, 2:21pm

Hi Huw

Thanks for your input and for trying to translate these very tricky ideas into layman’s terms.

After all this back-and-forth, I’m starting to think that the only way to make sense of observational study “confidence intervals” is by focusing on what Chris seems to be saying in post #10. Here’s the view I’ve arrived at (maybe wrong, but it’s the best I can do):

These intervals, meant to connote the degree of “random error” inherent in a study’s result, are “false” in the sense that their specific boundaries and width hinge on layers of assumptions that are usually not justifiable given that there is no true underlying random sampling from the underlying population NOR any random allocation that contributes to their generation;
If we are being extremely charitable, we might find some value in these intervals if we consider them to be “best case scenarios” with regard to the degree of uncertainty- in other words, if we view them as a crude representation of the “minimum” degree of uncertainty which might apply if we had actually been able to perform random sampling from a target population in order to generate them;
The main reason these intervals are so immensely problematic and pernicious is that, for many decades, researchers, journalists, and the public have NOT been viewing them as described in bullet point #2 above. And now it’s too late- nobody seems to be able to put the horse back in the barn;
So how did we get here? At some point in history, these intervals, and whether or not they contained the “null,” came to be used as a filter to decide whether or not a study deserved publication. Studies with intervals that crossed the null (results that did NOT achieve “statistical significance” i.e., p>0.05) were less likely to get published than those with intervals that excluded the null (i.e., achieved p<0.05) In turn, this filtering practice had two disastrous effects: 1) to cause researchers to bend over backwards, often in highly damaging ways, to generate intervals that don’t cross the null (e.g., multiple testing/garden of forking paths/HARK-ing); and 2) to cause researchers to propagate (loudly, via the media) the idea that intervals that don’t cross the null represent important scientific “discoveries.”
I’m not sure about the extent to which this horrible “inversion” in the interpretation of uncertainty intervals was rooted in widespread ignorance about their inferential limitations (it seems like there are many layers of assumptions involved in their generation, which many seem all too happy to simply ignore) or simple laziness (the desire to substitute “easy” work like dredging administrative databases for “hard” work involving painstaking triangulation of multiple lines of evidence) or perverse incentives (for publication and therefore career advancement). Arguably, the etiology of our current-day mess involves some combination of these three factors. The result of this 180-degree-distortion was the gradual but ultimately pervasive loss, over time, of researchers’ and research consumers’ understanding of the limitations inherent in the interpretation of uncertainty intervals;
To summarize, there is a widespread 180-degree-distorted interpretation, among researchers and research consumers, of the uncertainty intervals presented in observational studies. The distortion began many decades ago and has been constantly reinforced over the years, by various forces, to the point where, possibly, only a small minority of researchers active today appreciate their limitations. Instead of interpreting them in a very crude and maximally conservative way (their just-barely-defensible interpretation), most researchers and readers now interpret them in a falsely precise and maximally liberal way (a completely indefensible interpretation). Specifically, instead of viewing them as a crude representation of the bare minimum uncertainty in a study’s result (a “conservative” assessment/underestimate of the degree of uncertainty, which would be the best possible result IF we were to completely suspend our judgment regarding multiple unrealistic underlying assumptions), EVERYONE started viewing these intervals as an indicator of the maximum uncertainty around a result (a “liberal” assessment of the degree of uncertainty). As a result, we have elevated the role of the single study in scientific decision-making far beyond its actual value with regard to scientific inference.

HuwLlewelyn · August 18, 2025, 2:58pm

What do you think that would be the possible sources of untrustworthiness (e.g. inaccurate methods description?) leading to that bias and at what point in the process should one discuss it, try to identify it and perhaps attempt to compensate for it?

HuwLlewelyn · August 18, 2025, 3:40pm

My understanding is that there are many types of randomness. The random sampling model is applied to biological measurements that are subject to technical and biological variation and can be described as a likelihood or probability distribution. This applies to observational studies and measurements of diagnostic tests, RCTs, etc. Applying the random selection model depends on consistency of the measuring methods so that the only source of variation is stochastic. Failure to be consistent will introduce a source of bias.

I understand ‘randomisation’ in a RCT to be different and is an active process designed to create exchangeable groups to avoid bias from factors other than the effects of the ‘treatment’ compared to control. In an observational study, such bias has to be shown to be improbable by some unreliable rationale. Another issue is ‘random’ selection of trial subjects from the population to create a ‘representative’ group, which is difficult if not impossible to achieve. An inability to do this does not prevent treatment efficacy being assessed or the influence of disease severity on absolute risk reduction from treatment.

ESMD · August 18, 2025, 4:34pm

Yes, this is my understanding too. But again, I don’t think that this is the crux of the matter being discussed. Rather, the key question is the extent to which an observational study’s uncertainty interval, as usually constructed, is interpretable or not.

Suhail seems to think that my skepticism is founded in a failure to thoroughly understand terminology or the the impact of the different types of error (e.g., “model-based,” “probability”…). But, while he is probably correct to infer that I have a limited understanding of the content, I think he’s off base in concluding that this misunderstanding is the reason for my skepticism. Stipulating multiple possible failures of understanding on my part, I don’t think that it’s unreasonable for me, as a novice, to question any researcher’s ongoing justification of the validity of these intervals with regard to statistical inference, given that other researchers (e.g., Chris) seem to put so little stock in them…

My hazily-formed impression is that it is researchers’ ongoing promotion of the mistaken notion that there is both an interpretable and precise interpretation of these intervals (an interpretation which ends up promoting the most liberal estimation of uncertainty) that is THE WHOLE PROBLEM and why we’re in this mess in the first place.

f2harrell · August 18, 2025, 5:07pm

From enrolling the wrong persons to biased assessment of risk factors to biased unblinded assessments of outcomes, you name it. Confounding by indication is one of the bigger problems but there are often worse problems present. Thing of the early digitalis studies where dig was thought to harm patients and analysts did not account for dig being given to failing patients. In more modern times we see over an over again epidemiologists claiming some food hurts or helps you cancer-wise, failing to control for socioeconomic factors that have to do with both which foods you can afford and what kind of health care you have access to. Observational research is an industry and tens of thousands of researchers in the US have gotten promotions doing bad research that no one will ever replicate.

s_doi · August 18, 2025, 5:36pm

Frank (@f2harrell) has very nicely stated the crux of this whole discussion: “trustworthiness of the center of the interval”, by which he means the target of estimation. I can therefore restate a lay man summary as follows:

The center of the interval is the study’s best guess of the population value, and the UI shows how much play there might be just from chance. But even if the the way we got this guess is biased, there is still utility in the interval. For example:

Imagine a continuous glucose monitor in a patient with type 1 diabetes that is mis-calibrated and consistently reads 20 mg/dL too low — that’s large bias. Even so, the readings fluctuate continuously due to sensor noise — that’s random error. Knowing the amount of random variability is essential: it tells the patient or clinician how much confidence to place in a single reading or trend. For example, if the sensor shows a rise of 10 mg/dL, understanding the usual random fluctuation helps decide whether that rise is meaningful or just noise. Even with the baseline bias, assessing random error guides safe insulin adjustments and prevents over- or under-treatment.

HuwLlewelyn · August 18, 2025, 5:49pm

My understanding was that if the distribution from an observational study straddled zero difference between ‘treatment’ and control then a confidence interval would be useful and reliable in discounting any causal effect and so not pursuing the idea further. Do you agree? (Perhaps not if there was a strong hypothesis for a cause and bias had been away from a positive result!) However, if there was an apparent difference between treatment and control in an observational study (e.g. that the 95% confidence interval excluded zero difference) then I said that “you have a problem”. This would entail trying to reason away the issues that you summarise in the above quote and detail in your post. I suggested that the way forward would be to do a carefully designed RCT that minimised the risk of bias etc,. so that the mean and SEM of the distribution of differences between treatment and control from the RCT is helpful. I’m sorry if I did not make that clear.

s_doi · August 18, 2025, 6:35pm

I think the divergence of views here is because too much importance is being placed on systematic error alone. If I have an unbiased study with huge random error it would be as bad as a grossly biased study with little random error. There has to be a bias - variance trade-off which is what MA aims to achieve.

Suppose a study is designed perfectly with no systematic error, so on average it estimates the true population blood sugar correctly in a group of diabetics. But the sample is small and the measurements vary a lot, so the study result jumps around — one study might suggest the average is 120 mg/dL, another might suggest 200 mg/dL. Even though both are unbiased, the huge random error makes the study result unreliable for making decisions. However I do not know this unless I have a UI.

davidcnorrismd · August 19, 2025, 6:38am

A better way to have framed the original question might have been, “How to interpret the absence of a formal multiple-bias analysis from an observational study report?” I would say, like an estimate from the mechanic that itemizes parts but excludes labor.

So, interpret the CI as you would any other half-truth told by a self-interested party.

f2harrell · August 19, 2025, 1:15pm

The problem with bias is that it can work in all directions. It may cause you to miss a useful treatment effect. So there is not really any role for retrospective observational studies other than establishing predictors of outcomes for future covariate adjustment. And there is limited use for prospective observational studies in comparing treatments. One of the many problems is that “word gets out”, causing physicians to participate less in randomized studies.

HuwLlewelyn · August 19, 2025, 3:53pm

I suppose my feeling was that the prior expectation of an observational study providing a reliable prediction of some definitive RCT result is low to begin with, so that a narrow confidence interval around zero difference would lower the probability further and be decisively discouraging (although there would also be a possibility that some later RCT could show some useful result as you say).

I find this interesting. In the absence of evidence of some strong bias in a particular direction, would it be reasonable to assume that all these various ‘random’ biases in all directions would cancel each other out and be unlikely to affect the estimated mean? However, would these unknown biasing factors in combination determine the width of the distribution about the mean?

Unfortunately as suggested in the previous post by @davidcnorrismd, one strong bias in a hidden direction could be caused by an author covertly distorting a study to get some desired result or some other unforeseeable bias. This would be a daunting ‘joker in the pack’ to catch one out despite the above considerations.

s_doi · August 19, 2025, 6:32pm

I think this is directly relevant here because Carlin and colleagues say that the true statistical model (a model that closely approximates the true data-generating process) is a myth and that we should let go of this “true model myth”. What we should be chasing is a good model but the latter depends directly on the type of question that we are seeking to answer - descriptive, predictive or causal. “For example in the estimation of causal effects, we should aim to develop models and methods that focus first and foremost on reducing potential bias, with a key example being the need to control for confounding in observational studies. In this light, generic concerns about “good” models as reflected in traditional model-checking “diagnostics” may be a distraction, encouraging thinking that mirrors the true model myth rather than focusing on potential sources of bias with respect to the study aims”

s_doi · August 19, 2025, 6:44pm

People often dismiss observational studies by saying, “we can’t trust them because they’ll never give us the true model.” But this is the wrong way to think about it. The real issue isn’t whether a model is perfectly true — no model ever is — but whether we can reduce bias enough to produce evidence that’s good enough to guide decisions.

Think of it this way:

In causal research, the key is to lower bias as much as possible (for example, by adjusting for confounders). Even if some bias remains, reducing it substantially can turn weak, unreliable evidence into strong, actionable evidence.
In prediction, the question isn’t whether the model reflects reality exactly, but whether lowering bias in the inputs and assumptions makes predictions accurate enough to inform real-world choices.
In description, bias reduction means clearer, more trustworthy summaries of what’s happening in the data.

The obsession with the “true model” sets an impossible standard and makes people throw away valuable observational evidence. What really matters is whether the study design and analysis bring us closer to the truth than we would be otherwise — close enough to guide doctors, policymakers, or patients in making better decisions.

Example: The link between smoking and lung cancer was first demonstrated through observational studies, not randomized trials. Those studies were never free of bias — smokers and non-smokers differ in many ways. But by carefully adjusting for factors like age and occupation, researchers were able to reduce bias enough to show that the association was far too strong to be explained away. This bias reduction, not bias elimination, provided decision-worthy evidence that ultimately led to major public health action against smoking.

In short, the utility of observational research lies not in achieving zero bias (which is impossible), but in achieving less bias than before. That reduction, even if imperfect, is what makes the evidence “decision-worthy.”

trumanfrancis · August 19, 2025, 9:13pm

I think triangulation of evidence is an important concept. https://youtu.be/QhnmXORhzmM

ESMD · August 21, 2025, 1:10am

Below are some salient excerpts from this classic 2016 paper (bolding is mine).

“As with the P value, the confidence interval is computed from many assumptions, the violation of which may have led to the results. Thus it is the combination of the data with the assumptions, along with the arbitrary 95 % criterion, that are needed to declare an effect size outside the interval is in some way incompatible with the observations.”

“…the 95% refers only to how often 95% confidence intervals computed from very many studies would contain the true effect if all the assumptions used to compute the intervals were correct.”

“When the model is correct, precision of statistical estimation is measured directly by confidence interval width (measured on the appropriate scale).”

“As with P values, further cautions are needed to avoid misinterpreting confidence intervals as providing sharp answers when none are warranted.”

“The above list could be expanded by reviewing the research literature. We will however now turn to direct discussion of an issue that has been receiving more attention of late, yet is still widely overlooked or interpreted too narrowly in statistical teaching and presentations: That the statistical model used to obtain the results is correct.”

“Too often, the full statistical model is treated as a simple regression or structural equation in which effects are represented by parameters denoted by Greek letters. ‘‘Model checking’’ is then limited to tests of fit or testing additional terms for the model. Yet these tests of fit themselves make further assumptions that should be seen as part of the full model. For example, all common tests and confidence intervals depend on assumptions of random selection for observation or treatment and random loss or missingness within levels of controlled covariates. These assumptions have gradually come under scrutiny via sensitivity and bias analysis [98], but such methods remain far removed from the basic statistical training given to most researchers.

“In response, it has been argued that some misinterpretations are harmless in tightly controlled experiments on well-understood systems, where the test hypothesis may have special support from established theories (e.g., Mendelian genetics) and in which every other assumption (such as random allocation) is forced to hold by careful design and execution of the study. But it has long been asserted that the harms of statistical testing in more uncontrollable and amorphous research settings (such as social-science, health, and medical fields) have far outweighed its benefits, leading to calls for banning such tests in research reports— again with one journal banning P values as well as confidence intervals [2].”

“We further caution that confidence intervals provide only a best-case measure of the uncertainty or ambiguity left by the data, insofar as they depend on an uncertain statistical model.”

Goodhart’s Law when applied to uncertainty intervals: When a measure (uncertainty interval) becomes a target (excluding the null/achieving p<0.05), it ceases to be a good measure. By making p<0.05 the bar for publication, we have encouraged researchers to hunt for ways to downplay the uncertainty in their results and overstate the importance/actionability of any intervals they can generate which don’t cross the null.

Using David’s analogy, I’m not going to hire a mechanic if he tries to gaslight me into believing that a pre-job estimate including only the cost of replacement parts will translate to my final invoice (parts + labour). So why in the world are observational studies showing weak effects, with confidence intervals that happen to exclude the null, still plastered all over clinical journals, as though clinicians should somehow allow them to influence clinical decision-making? This is not good, since some clinicians, unfamiliar with the limitations of such research, will potentially allow such findings, prematurely, to affect their practice.

Observational studies have, at times, played key roles in the advancement of medicine and science (e.g., establishing the link between smoking and lung cancer or cholera and contaminated water). But causal effects this dramatic and incontrovertible are the exception, rather than the rule. In most cases, when smaller effects are being considered, it would, arguably, be much more appropriate to publish observational work related to disease etiology in basic science journals, allowing researchers, over time, to contribute collectively to a larger effort geared toward triangulation of many lines of evidence. Only THEN, following laborious triangulation and serious thought about the definitiveness of the findings and the impact of publication on patients and clinicians, should researchers consider submitting their findings to clinical journals (and/or media outlets).

s_doi · August 21, 2025, 4:39am

I agree with everything statistical / epidemiological in the paper which is what we have already stated and clarified many times in this thread above (see points #2 and #3).

However the conclusion (last paragraph you quoted) is overstated. This framing risks setting an unrealistic threshold for the use of evidence in medicine. While it is true that observational studies rarely provide the kind of dramatic and incontrovertible findings that the smoking–lung cancer or cholera–water examples illustrate, it does not follow that their value should be relegated to basic science journals until decades of triangulation accumulate. In many areas of clinical and public health decision-making, observational evidence is the only form of evidence we can obtain—whether because randomized trials would be unethical, infeasible, or prohibitively costly. To dismiss such evidence until “definitiveness” is reached leaves practitioners and patients with no guidance at all, which is not a neutral position but itself a decision with consequences.

The principle of evidence-based medicine has never been “only act on incontrovertible evidence,” but rather to make the best possible decisions with the best available evidence. Observational studies, when conducted rigorously and interpreted with transparency about their limitations, provide essential information on risks, benefits, and trade-offs. Bias reduction strategies, causal inference methods, and careful triangulation can strengthen confidence, but waiting for certainty risks inaction in contexts where timely decisions save lives.

In short, clinical journals serve not only to disseminate perfect knowledge but to guide practice in the face of uncertainty. Excluding well-designed observational work until an undefined threshold of “definitiveness” is achieved would create an evidence vacuum that clinicians would be forced to fill with anecdote, opinion, or commercial interest. The alternative—publishing and critically appraising observational studies as part of the evolving evidence base—is both more realistic and more aligned with the needs of patients and practitioners.

f2harrell · August 21, 2025, 11:14am

That seems to assume to reporting some results is better than reporting nothing. I question that assumption about half the time.

ESMD · August 21, 2025, 3:11pm

“In short, clinical journals serve not only to disseminate perfect knowledge but to guide practice in the face of uncertainty. Excluding well-designed observational work until an undefined threshold of “definitiveness” is achieved would create an evidence vacuum that clinicians would be forced to fill with anecdote, opinion, or commercial interest. The alternative—publishing and critically appraising observational studies as part of the evolving evidence base—is both more realistic and more aligned with the needs of patients and practitioners.”

I disagree with this view, except in specific clinical contexts. Post #13 of the Table 2 Fallacy thread Table 2 Fallacy?: Association of 5α-Reductase Inhibitors With Dementia, Depression, and Suicide describes situations in which I think it can be reasonable for physicians to allow less-than-definitive observational evidence of treatment harm to influence their clinical decision-making. A high-profile example of a situation where clinicians should act on observational evidence of treatment efficacy is masking to reduce the spread of airborne disease (but efficacy here was established primarily by aerosol scientists and also involved experimentation).

Putting aside the niche contexts described above, the critical question is whether, in most cases, patients will be better off if their physician allows observational evidence to guide clinical decision-making. In the case of studies showing weak harm signals, which overstate the certainty of their findings, I feel strongly that the answer is “no,” provided that the clinician is making a concerted effort to apply treatments with well-established efficacy.

I could rattle off an arms-length list of potential “harms” of medications I prescribe every day, as identified through observational studies (often by researchers who have fashioned themselves cushy careers by dredging the same administrative database over and over). But if I did that in front of my patients, none of them would want to take statins, PPIs, antidepressants, vaccines, BPH medications, or any type of painkiller. And if this were the case, then I expect that the morbidity and mortality among my patients from MI, CHF, ulcer, GI bleed, infectious disease, urinary retention, acute kidney injury, self-harm, unemployment, and family dysfunction would be much higher.

Clinical journals that hype studies showing weak harm signals are effectively suggesting to clinicians that they should routinely allow highly uncertain evidence to guide their clinical decision-making. But this stance implicitly assumes that physicians don’t have a good rationale for the prescriptions they write - an assumption that’s totally off-base in most cases. Are some medications prescribed inappropriately? Yes! Should all physicians constantly strive to improve their prescribing practices? Yes! But physicians who routinely prescribe inappropriately are NOT likely to be the ones who follow the medical literature (so attempts to “scare” them into doing better by publishing lists of potentially catastrophic consequences of their prescribing are likely to fall on deaf ears anyway…).

I use UptoDate every single day in the office, during the care of my patients. It’s a great resource which summarizes the evidence base for most common medical treatment decisions. I can honestly say that the observational studies described in UptoDate almost never meaningfully influence my practice. I do acknowledge, however, that they might occasionally dissuade me from certain types of “off-label” prescribing. Maybe this approach makes me a bad doctor (?) I don’t know…

I’m not “anti”- observational evidence- far from it. I think observational evidence is indispensable for the purpose of disease surveillance, describing populations, and for finding important, unanticipated strong signals that there can be adverse long-term effects of certain exposures (e.g., vaginal cancer and in utero exposure to diethylstilbestrol). Many people can die if we don’t have good descriptive epidemiologic evidence. I don’t know anything about developing prediction models, so I can’t comment on this application. But I feel strongly that observational studies with obviously causal aims (though this goal is often not stated explicitly) which involve small effects (in relative terms), simply have too much inherent uncertainty, in most (but not all), contexts, to (safely) influence clinical decision-making.

Pavlos’ terrific triangulation of the evidence for a link between vigorous exercise and development of renal medullary cancer among patients with sickle cell trait is an example of the type of observational evidence that I think clinicians should perhaps act on, because, in this specific clinical context, there’s arguably little to no downside to doing so (other than to limit, somewhat, the spectrum of exercise options we recommend for people with sickle cell trait). So too is the painstakingly-triangulated evidence used to show the relationship between EBV infection and subsequent risk of developing MS. These are fantastic examples of the value of observational evidence. But it’s the amount of work that went into establishing these relationships that renders them orders of magnitude more compelling, from a clinical standpoint, than the vast majority of observational studies we see being hyped today in clinical journals.

ChristopherTong · August 22, 2025, 7:20am

For me the crux of the discussion is: why would we even expect probability math (the mathematics of games of chance) to have any relevance to thinking about the results of an observational study? It seems like we have no protection from knowing if we have committed the Ludic Fallacy when we use “statistical inference” on such data. John Snow didn’t use any probability or statistics when he successfully figured out the cholera-water problem.

(Addendum: Snow didn’t use “statistics” as defined by what is in a standard modern textbook of statistical inference. He did use “statistics” in the sense that word was used in the 19th century, not the 21st. See my paper on Andrew Carnegie for more elaboration of this point.)

ChristopherTong · August 22, 2025, 7:23am

Sometimes the best decision to be made following an assessment of data is that we need more and better data before we can begin thinking about a model at all. Other times, we tentatively make a model but remind ourselves that statistical inferences from that model (including UIs and p-values) are statements about the model, not about the real world. See “Escape from Model Land” by Erica Thompson (Basic Books, 2023).