How to interpret “confidence intervals” in observational studies

s_doi · August 30, 2025, 9:12am

I did that already - there were three columns initially and I edited the post to make them four (justification = Design features that support target plausibility). The big red question mark is already there in words

HuwLlewelyn · August 30, 2025, 12:26pm

My problem is trying to convince people that basing diagnoses and decisions using ‘abnormal ranges’ and average absolute risk reduction is not a good idea and that i is important to consider the effect of covariates, especially disease severity. There is great concern about overdiagnosis and overtreatment but a refusal to consider that this failure might be the cause.

One eminent trialist argued that the confidence intervals of adjusted odds ratios were so wide that they were not worth applying to the baseline risk of individual patients and we should stick to average absolute risk reductions from the whole trial. These wide confidence intervals also apply of course to other methods such as estimating the crude odds ratio or using the Mantel–Haenszel method. Ideally there should be a way of calibrating the outcomes probabilities and absolute risk reductions with their confidence intervals arising from all these models which make many assumptions of course.

I agree that adjusted odds ratio is the standard approach, expected by the FDA etc. and able to accommodate multiple covariates. However, in the meantime, I find it simpler and easier to explain the principles by using raw odds ratio applied to the dominant covariate (e.g. disease severity) or fitting logistic regression functions to the placebo and treatment limbs to look for treatment heterogeneity.

f2harrell · August 30, 2025, 12:32pm

On one point, absolute risk reductions when averaged or just computed marginally may not apply to anyone in or out of the trial. Adjusted ORs and patient-type-specific absolute risk reduction estimates are more appropriate to apply in individual decision making. When uncertainty intervals are too wide for these, we should not necessarily jump to an inapplicable measure.

ESMD · August 31, 2025, 6:42pm

I tried, unsuccessfully, to determine who published the first observational study that included a 95% confidence interval around its point estimate and provided a justification for this decision. I did, however, come across a useful textbook written by Alvan Feinstein, published in 1977 (available free online):

Several sections seem to get to the heart of the issues discussed in this thread (most were found after a keyword search of the text for the term “confidence interval”); bolding is mine:

(Page 22)

…Statistical principles of experimental design are not pertinent for the many observational surveys, therapeutic surveys, and methodologic explorations that provide the fundamental data of clinical science. Since these surveys and explorations are neither designed nor conducted as “experiments,” their intellectual construction is not considered in statistical descriptions of “experimental design.”

Chapter 2- Random Sampling and Medical Reality (page 170)

…The two main applications of sampling. The data obtained in a sample are usually used for either parametric estimation or analytic contrast. In the estimation procedure, we simply want to know about the magnitude of the target variable. From what was found in the sample, we infer the true value or “parameter” in the parent population…

In most situations of clinical and epidemiologic research, the sample is used for analytic contrast, rather than for estimation. The goal is not to estimate the magnitude I of the target variable in the parent population, but to contrast the results found in two different groups, which are defined according to some underlying characteristic. The implication of an analytic contrast is that the difference in results is “caused” by the underlying characteristic…

…Thus, in the samples obtained for estimation, the main questions to be answered about the target variable are how many or how much. In the samples obtained for contrast, the main questions to be answered are how big is the difference and is the difference caused by the underlying variables…

…The mathematical interpretations used for estimating a parameter from a sample are seldom valid when the sample is used for analytic contrast, because the underlying variable is no longer a “passive” term of description.

…Consequently, when a sample is considered for purposes of analytic contrast, rather than for parametric estimation alone, there are two questions of representativeness, not one. The first question deals with choosing the people in the sample; the second deals with allocating the contrasted maneuvers to those people. The problems of allocating an investigated maneuver are substantially different from those of sampling a population,…

SECTION FOUR- MATHEMATICAL MYSTIQUES AND STATISTICAL STRATEGIES (page 285)

CHAPTER 23- Problems in the summary and display of statistical data (page 335)

(Page 342)

In clinical and epidemiologic research, however, the groups under investigation are almost never selected as random samples; and the investigator’s main interest is in contrasting results for two or more groups, not in estimating parameters for a single sample. In these forms of research, neither the standard error nor the confidence interval can supply suitable answers to the fundamental challenges of communication and probabilistic contrast…

…The standard error and confidence interval are not summaries of evidence; they are acts of inference. They do not describe the data. They provide, at most, a guess about the location of the mean of a mythical population in which the investigator may have no interest: and to which, even if interested, he cannot relate his results because the investigated group was not randomly sampled from that population…

…(As long as statistical consultants maintain the belief that any selected groups can be arbitrarily regarded as random samples, the consultants will persist in giving unsuitable advice. It is true that parametric techniques of inference can be, and are, regularly applied for probabilistic contrasts of groups that were not selected randomly. The reason the parametric techniques are effective, however, is not because the contrasted groups can arbitrarily be regarded as random samples, but because the P values that emerge from the indirect parametric calculations are often quite similar to those accurately obtained by direct permutation methods. The persistence of the arbitrary fiction about random sampling during a probabilistic contrast of groups allows the attention of the consultants and their clients to be diverted into thinking about abstract statistical inference, rather than searching for realistic sources of bias in the selection of the groups. The pervasive neglect of this problem is probably the single greatest cause of malpractice* in academic biostatistical consultation today.)…

The author seems to consider the presentation of confidence intervals for studies focused on analytic contrasts between groups, rather than simple description, to be questionable, since parametric methods were not developed originally for this purpose (?) Rather, parametric methods were developed to aid in estimating population parameters on the basis of single random samples, at a time when permutation-based methods of estimation weren’t feasible due to computational contstraints (?) And since Feinstein seems to only very grudgingly (?) accept that parametric methods can be defensible for calculating p values (and, presumably, confidence intervals) in studies comparing experimental groups (which at least involve random allocation), I will infer that he wouldn’t have any idea how to justify their presentation around observational study point estimates…

Understanding all this stuff is a real struggle for someone without much formal training. I tried my best. If anyone has a different interpretation of what Feinstein is saying, I’d be glad to hear it. But, so far, my understanding is that parametric methods were a convenient way, before the time of computers, to estimate population parameters from random sampling conducted in several fields outside medicine.

At some point (?1970s-80s), clinicians started to do more formal/structured medical research and asked for statisticians’ help in statistically analyzing their data for the purpose of publishing in clinical journals. Most of these studies involved analytic contrasts between groups (not estimation of population parameters from single random samples). I found several articles published in the 1980s that provided “guidelines” for how statistics should be presented in medical journals- inclusion of confidence intervals was encouraged. At that point, either out of habit (they had been working with random sampling up until then) and/or because powerful computers still weren’t available to do permutation-based analyses easily, statisticians (and/or epidemiologists) decided to extrapolate parametric methods from their previous use in estimation from single random samples to inference based on contrasts between two non-random samples. And, in an act of even more profound methodologic distortion, someone must have eventually extrapolated these methods even further, from their application in experimental settings (where permutation methods founded on random allocation might still be defensible) to observational settings (which involved neither random sampling nor random allocation).

This “inflection point” in the history of application of statistical methods to clinical data seems to have been triggered by the need to agree on publication standards for medical research (which was exploding at the time). Unfortunately, these very questionable statistical capitulations were so complicated in their derivation/attempted justification that most researchers simply accepted them, relatively unchallenged. And, since these methods persist even today, we can assume that any criticism of this perversion of parametric methods has simply been deflected with hand-waving by several subsequent generations of researchers who have decided it’s easier to just “go with the flow.” And here we are, 40 years later, still dealing with the fallout of an (arguably) highly destructive historical decision…And news headlines underpinned by this capitulation are affecting patients’ decisions about which medications to start and stop…

Please tell me if I’ve got this all wrong (I really hope this is the case; otherwise…gulp…).

Pavlos_Msaouel · August 31, 2025, 8:11pm

On a related note, just finished David Spiegelhalter’s latest book “The art of uncertainty”. Highly recommended with multiple pearls and historical context on how to communicate uncertainty.

Spiegelhalter’s statistical work is highly influential and part of the inferential DNA of many applied data scientists, especially Bayesians. He is an integral part of my inferential DNA. But he is also an excellent communicator with decades of experience in popularizing how to popularize data analysis results.

f2harrell · August 31, 2025, 9:59pm

That book has made it to my reading table. Looks like a good one.

The quotes from Alvan Feinstein, whom I knew pretty well and is considered to be the father of clinical epidemiology, are really perceptive and hit several points precisely. One of them explains my disdain for confidence intervals on individual Kaplan-Meier curves for non-sampled data (even RCTs).

ChristopherTong · September 1, 2025, 6:15am

This is peripheral to the thread, but since I mentioned it here in post 31, I have now posted (in a temporary location) my reply to Gang (John) Xie’s op-ed in last December’s issue of Amstat News. However my reply adds nothing to the thread.

ChristopherTong · September 1, 2025, 6:48am

These quotes from Feinberg are quite eye-opening. They inspired me to look at K. J. Rothman, 2012: Epidemiology: An Introduction, 2d ed. (Oxford University Press). On page 150, he writes:

The confidence interval is defined statistically as follows: If the level of confidence is set to 95%, it means that if the data collection and analysis could be replicated many times and the study were free of bias, the confidence interval would include within it the correct value of the measure 95% of the time. This definition presumes that the only thing that would differ in these hypothetical replications of the study would be the statistical, or chance, element in the data. It also presumes that the variability in the data can be described adequately by a statistical model and that biases such as confounding are nonexistent or completely controlled. These unrealistic conditions are typically not met even in carefully designed and conducted randomized trials. In nonexperimental epidemiologic studies, the formal definition of a confidence interval is a fiction that at best provides a rough estimate of the statistical variability in a set of data. It is better not to consider a confidence interval to be a literal measure of statistical variability but rather a general guide to the amount of random error in the data.

s_doi · September 1, 2025, 6:56am

I would agree with that - I wouldn’t draw them either on KM curves, but I do report IQR’s for survival estimates at fixed time points.

I also knew Alvan since my medical school days and have communicated with him through postal letters (well before the internet appeared on the scene). He sadly passed away in 2001.

f2harrell · September 1, 2025, 12:39pm

The good quote from Rothman is to me the nail in the coffin of trying to consider study repetitions or sampling distributions and should push us all the way to Bayes which tries to uncover the hidden truths in the data generating process whatever that process is. That is what makes Bayes apply to one-time events. Coupled with the ease with which bias parameters can be added to Bayesian regression models, why are we still pretending that confidence intervals apply to one-time studies?

s_doi · September 1, 2025, 2:11pm

I agree fully that these intervals were not created in 1930’s or thereabouts to apply to one-time studies and their interpretation regarding the realized one-time study has only recently been clarified. However, they are not going to go away and therefore it is as important to clarify them for frequentist use as it is to advocate the Bayesian alternative. So we have to do both together.

Btw, just for Erin’s attention this is what Feinstein said in his book (Principles of Medical Statistics):

In contrast to the process of parametric sampling, the observed group in most medical research is not chosen as a random sample, and a parameter is not being estimated. Nevertheless, because the group is a sample of something, the central index might have numerical variations arising merely from the few or many items that happen to be included. For example, a particular treatment might have a success rate of 50% in the “long run” of 200 patients, but might be successful in 75% of a “short run” of 4 patients. By making provision for numerical caprices that might occur in a group of size n, the confidence interval serves to warn or anticipate what could happen in the longer run.

ESMD · September 1, 2025, 3:17pm

This 2006 publication (Greenland, S. Bayesian perspectives for epidemiological research: I. Foundations and basic methods) also speaks directly to the issues in this thread (again, bolding is mine):

In the randomized trial and random-sample survey in which they were developed, these frequentist techniques appear to be highly effective tools. As the use of the methods spread from designed surveys and experiments to observational studies, however, an increasing number of statisticians questioned the objectivity and realism of hypothetical infinite sequences.1–7 They argued that a subjective-Bayesian approach better represented situations in which the mechanisms generating study samples and exposure status were heavily non-random and poorly understood. In those settings, which typify most of epidemiology, the personal judgements of the investigators play an unavoidable and crucial role in making inferences and often over-ride technical considerations that dominate statistical analyses (as perhaps they should 8).

I argue that observational researchers (not just statisticians) need training in subjective Bayesianism 1,2,16 to serve as a counterweight to the alleged objectivity of frequentist methods. For this purpose, neither so-called ‘objective’ Bayesian methods 17 nor ‘pure likelihood’ methods 18 will do, because they largely replicate the pretence of objectivity that render frequentist methods so misleading in observational settings.

Subjective Bayesian methods are distinguished by their use of informative prior distributions; hence, their proper use requires a sound understanding of the meaning and limitations of those distributions, rather than a false sense of precision. In observational studies neither Bayesian nor other methods require precise computation, especially in light of the huge uncertainties about the processes generating observational data (represented by the likelihood function), as well as uncertainty about prior information.

A crucial parallel between frequentist and Bayesian methods is their dependence on the model chosen for the data probability P(data j parameters). Statistical results are as sensitive to this choice as they are to choice of priors. Yet, this choice is almost always a default built into statistical software, based on assumptions of random sampling or random treatment assignment (which are rarely credible in observational epidemiology), plus additivity assumptions. Worse, the data models are often selected by mechanical algorithms oblivious to background information, and as a result often conflict with contextual information. These problems afflict the majority of epidemiological analyses today in the form of models (such as the logistic, Poisson, and proportional-hazards models) that make interaction and dose–response assumptions, which are rarely if ever justified. These models are never known to be correct and in fact cannot hold exactly, especially when one considers possible study biases. 27,28

It is important to recognize that the subjective-Bayesian interpretation is much less ambitious (and less confident) than the frequentist interpretation, insofar as it treats the models and the analysis results as systems of personal judgements, possibly poor ones, rather than as some sort of objective reality. Probabilities are nothing more than expressions of opinions, as in common phrasings like ‘it will probably rain tomorrow’. Reasonable opinions are based heavily on frequencies in past experience, but they are never as precise as results from statistical computations.

Frequentist fantasy vs observational reality

For Bayesian methods, there seems no dispute that the results should be presented with reference to the priors as well as to the data models and the data. For example, a posterior interval should be presented as ‘Given these priors, models, and data, we would be 95% certain that the parameter is in this interval’.

A parallel directive should be applied to frequentist presentations. For example, 95% confidence intervals are usually presented as if they account for random error, without regard for what that random error is supposed to represent. For observational research, one of the many problems with frequentist (repeated-sampling) interpretations is that it is not clear what is ‘random’ when no random sampling or randomization has been done. Although ‘random variation’ may be present even when it has not been introduced by the investigator, in observational studies there is no basis for claiming it follows the distributions that frequentist methods assume, or any known distribution.27 At best, those distributions refer only to thought experiments in which one asks, ‘if data were repeatedly produced by the assumed random-sampling process, the statistics would have their stated properties (e.g. 95% coverage) across those repetitions’. They do not refer to what happens under the distributions actually operating, for the latter are unknown. Thus, what they do say is extremely hypothetical, so much so that to understand them fully is to doubt their relevance for observational research.4

Frequentist results are hypothetical whenever one cannot be certain that the assumed data model holds, as when uncontrolled sources of bias (such as confounding, selection bias, and measurement error) are present. In light of such problems, claims that frequentist methods are ‘objective’ in an observational setting seem like propaganda or self-delusion.4–6,26,28 At best, frequentist methods in epidemiology represent a dubious social convention that mandates treating observational data as if they arose from a dream or fantasy of a tightly designed and controlled randomized experiment on a random sample (that is, as if a thought experiment were reality). Like many entrenched conventions, they provoke defences that claim utility12,33 without any comparative empirical evidence that the conventions serve observational research better than would alternatives. Other defences treat the frequentist thought experiments as if they were real—an example of what has been called the mind-projection fallacy.34

Were we to apply the same truth-in-packaging standard to frequentists as to Bayesians, a ‘significant’ frequentist result would be riddled with caveats like ‘If these data had been generated from a randomized trial with no drop-out or measurement error, these results would be very improbable were the null true; but because they were not so generated we can say little of their actual statistical significance’. Such brutal honesty is of course rare in observational epidemiology because emphasizing frequentist premises undermines the force of the presentation.

That last line really hits home. It highlights that it is the perverse incentives for fame/publication/promotion- the ability to shout “look what I’ve discovered!” and/or “look where I published my results”- which have perpetuated the egregious over-interpretation of clinical observational evidence. IMHO, this needs to stop.

f2harrell · September 1, 2025, 4:18pm

These words of Sander Greenland are letter perfect and are in direct conflict with the opinion of @s_doi that we need to deal with frequentist interpretations because they are still used. THIS HAS TO STOP.

s_doi · September 1, 2025, 5:01pm

Frank, I agree completely with Sander’s comments and there is no conflict between us. I said this clearly before and in the table above that if the target of estimation is wrong the frequentist measures of precision do not matter. That is exactly what Sander is saying too albeit with much more emphasis and emotion, which only results in frequentists becoming more entrenched because they are offended.

For example Sander says " Frequentist results are hypothetical whenever one cannot be certain that the assumed data model holds , as when uncontrolled sources of bias (such as confounding, selection bias, and measurement error) are present. In light of such problems, claims that frequentist methods are ‘objective’ in an observational setting seem like propaganda or self-delusion"

This is exactly what I have been saying, but more politely, and what the table says and anyone (including myself) that “claims that frequentist methods are ‘objective’ in an observational setting” is indeed making a gross error (I prefer to use this term rather than Sander’s choice of “is involved in propaganda or self-delusion”).

ESMD · September 1, 2025, 6:33pm

…which only results in frequentists becoming more entrenched because they are offended.

That Greenland article was published almost 20 years ago. As far as I can tell, not much has changed since then with regard to the way that observational evidence is promoted in the media or in clinical journals. But I don’t think the reason nothing has changed is because observational researchers who use frequentist methods are easily “offended” by having their bad habits called out. Rather, I think they just have too much to lose by changing their ways. This is why repeated, blunt criticism from top statistical experts is needed, as opposed to vague/obfuscating language that could end up being interpreted, by many, as an endorsement of the status quo.

To date, researchers who perpetuate these practices have only been rewarded for their approach. The faster they churn out studies, the more publications they accrue, and the greater their fame and chance for promotion. Including confidence intervals around point estimates, then claiming “discovery” when said intervals don’t cross the null, has worked very well for them until now, so what’s the incentive for them to learn a whole new approach to study design and analysis? It feels like it’s long past the time to implement a deterrent disguised as an incentive.

Maybe journals should start requiring subjective Bayesian- rather than frequentist- presentations of observational evidence (?) At the very least, this practice would compel the worst actors (i.e., researchers with careers historically dependent on the dredging of administrative databases, followed by HARK-ing) to openly address the elephant in the room- namely, that their “results” might constitute, AT BEST, a highly tentative and non- actionable first step in exploring a question of importance to patients. Mandating statement of an explicit “prior” would force many to admit that they had little or no reason to suspect a causal association between two variables before they dredged that database. This forced admission could cause many to change their ways. Of course, journals would also need an incentive to overhaul their editorial practices and become familiar with subjective Bayesian methods.

Until such time that subjective Bayesian methods could be understood by the broader research community, journals would risk a dramatic drop in their submissions and pool of qualified reviewers. In turn, publishers with multiple billions of dollars at stake would risk a dramatic drop in their profits. And, given the likelihood that these changes could curb the phenomenon of hyper-prolific authors, publication rates would plausibly also drop more permanently. This would be a very good outcome for science, but publishers would likely fight the change every step of the way, given the amount of money at stake.

Sigh…Clearly, these changes are not coming any time soon. But, equally clearly, they do need to come. If history is any guide, it’s clear that such a major change will only “take root” in the research ecosystem through a concerted, likely decades-long effort to expose up-and-coming new statisticians and epidemiologists to Bayesian methods during their training.

s_doi · September 1, 2025, 6:49pm

I agree, if we can make a wholesale move to the Bayesian approach, as Frank has said, we now have the assumptions staring at us in the face so less chances of what Sander calls propaganda and or self delusion.

I am being pragmatic - it will not happen in our life-time, so why do we need to have that struggle? Lets fix what we have and also advocate for change. Complaining and name calling gets us nowhere and that is why Sander concluded that methodological development and training should go beyond coverage of mechanistic biases (e.g., confounding, selection bias, measurement error) to cover distortions of conclusions produced by statistical methods and psychosocial forces.

Addendum: We have begun covering distortions of conclusions produced by statistical methods in our medical school as part of the curriculum. We must make sure that frequentist approaches are understood as part of the core clinical curriculum for a simple reason - the board exams (we use IFOM which is akin to USMLE) support this and sometimes the questions themselves are full of cognitive biases - what do we do about this?

ESMD · September 1, 2025, 7:56pm

This reasoning doesn’t make sense to me. And this view is exactly why nothing has changed for the past 50 years ! Very few stats/epi professors teaching today were likely to have been educated in Bayesian methods. These methods are difficult to grasp and it’s become very clear that many are simply unwilling to invest the time and effort to do so. And if professors don’t learn them, then students will never learn them. Something has to give…They are complacent and stuck in their ways.

Science doesn’t progress if its leaders always take the easy way out. Physicians have to keep up-to-date on therapeutic advances for the benefit of their patients. We no longer lobotomize patients with psychosis. Pavlos learned about DAGs (a heroic effort!) so he could apply this knowledge to design of oncology RCTs. So why shouldn’t stats and epi professors also be required to learn and teach newer methods as they arise? Who holds their feet to the fire in this regard?

For the same reason, this view and its circular logic doesn’t make any sense to me: “The board exams are founded in faulty epi/stats concepts, so we have to keep teaching the faulty methods to help our students pass their exams” (??) I appreciate your frustration, but the solution is not capitulation, but rather a very loud expression of the importance of changing the system.

Finally, viewing this problem not simply as an “academic” issue, but also as a massive economic one helps to put the urgency of change into better perspective. Physicians can easily disregard inappropriately-presented observational evidence and not allow it to influence their practice. But think about all the money that governments (and therefore taxpayers) and other funding bodies spend to pay for research that has no discernible impact because of poor design and analysis. And about the cost of poor medical decisions made by patients after reading badly-designed and presented research (or hearing about it on the news). And about the cost to payers for all the medical appointments that are needed to help address patients’ questions and provide reassurance and appropriate context. And about how many other ways talented researchers might be able to make important contributions to science if they’re able to get off the publish-or-perish hamster wheel.

f2harrell · September 1, 2025, 8:01pm

Pragmatism is not working, but practical Bayesian solutions and “just saying no” when we are reviewers or editors will.

Many have said that a blanket move to Bayes will just introduce new problems. I strongly disagree with this, for the following reasons:

just doing something a different way causes everyone involved to think more
Bayes does not allow the “absence of evidence is not evidence of absence” error to occur
Bayesian uncertainty intervals are exact if all assumptions hold whereas frequentist intervals and p-values are often incorrect even under ideal situations due to sampling distributions being much different from Gaussian
perhaps most importantly, Bayesian approaches better expose the assumptions being made, including an explicit prior rather than a highly subjective shifting of evidence about data (p-values) to evidence about effects. You can easily do stupid things with Bayes but it’s also more easy to catch perpetrators. Related to this I’ve seen reviewers be on their toes when critiquing application of Bayes but being completely at ease with major assumption violations when familiar frequentist methods are used.
the ease of incorporating unmeasured confounders and other model departures into Bayesian models by adding catch-all bias parameters. If an author only gets a “definitive” result when the bias parameter has extremely small variation about zero, you know that the study is not capable of providing a conclusion.

As a slight aside, most observational data analyses have missing data. The frequentist approximate ways of handling this are going to ultimately be found to be disappointing. Bayes formalizes this problem and provides more easy-to-criticize (by making models explicit) methods

s_doi · September 2, 2025, 3:00am

I agree Frank, and I am all for embracing change - when it happens. The push to replace frequentist with Bayesian inference is unlikely to overcome poor design, weak incentives, and cognitive biases which will persist regardless of statistical framework. For example, bans on p-values and confidence intervals, such as at BASP, have not triggered a widespread embrace of Bayesian methods because culture, training, and practical inertia weigh more heavily than practical arguments. A more pragmatic stance is to recognize that Bayesian tools can offer clearer interpretability and cumulative learning when applied to the right problems, but that wholesale replacement is neither realistic nor necessary. Change will come less from bans or advocacy (Sander has been doing this for decades and the 2016 ASA statement did not make a dent) than from demonstrating practical advantages in real scientific applications. The latter is what should start now if we are to see change happen sooner rather than later.

ChristopherTong · September 2, 2025, 6:56am

Bayesian uncertainty intervals are exact if all assumptions hold

“…if all assumptions hold…” is the same caveat that the frequentist methods have.

the ease of incorporating unmeasured confounders and other model departures into Bayesian models by adding catch-all bias parameters

Is there anything in the data that helps estimate these parameters, or are they left unchanged after the model is fit?