How to interpret “confidence intervals” in observational studies

James_Stanley · August 13, 2025, 9:23pm

Here is the paper I noted above from Brillinger, back in 1986, that includes a laid out argument for considering variation in whole-of-population measures. It also includes invited commentary from a lot of statisticians which adds a lot of value both as a historical document and as a collated piece of work.

Brillinger DR. The natural variability of vital rates and associated statistics. Biometrics. 1986 Dec;42(4):693-734. PMID: 3814721.

Here are the first couple of paragraphs, just to give a feel for what is covered.

Vital statistics are data on the fundamental events of human lives-events such as birth, death, marriage, and the like. They usually take the form of counts or rates and are often collected via censuses and legally required registrations. They are used for summarization, comparison, forecasting, detection of change, hypothesis generation, surveillance, and studying public health generally.

A continuing presence is a wish to make comparisons-comparisons between regions, (comparisons between) time periods, (comparisons between) social groups, and so on. Now, in many circumstances the data are virtually complete so that it is a fact that death rates differ for two counties or two years or two races. What is more likely of concern then is: do two death rates differ by more than some level of natural fluctuations? The purpose of this study is the stochastic conceptualization and formalization of the natural variability of vital statistics to support their use in comparisons and other analyses. The particular cases of mortality and of groups specifically delineated in age and time will be emphasized; however, results for a broad variety of other cases should be apparent.

s_doi · August 14, 2025, 2:11am

NB have edited the picture to add a legend to explain your first two questions

When I talk about infinite membership I mean I can repeat the same study with the same sample size and selection rules but include a completely different set of people and can do this (at least theoretically) infinitely. Statistical inference is focused on the variability across these studies keeping everything else constant.

Probability models assume that the populations they model follow the rules in which our study subjects were recruited/selected for study.

ChristopherTong · August 14, 2025, 5:44am

why do Miguel Hernan and colleagues go to great lengths to emulate a target trial using observational data and then proceed to provide statistical inferences based on such designs?

I’m not sure which paper you are referring to, and without reading it, I would not know the answer. I would be happy to learn if the causal inference people have a different perspective than mine on this topic, if they specifically address @ESMD 's original question.

For a more general discussion, more eloquent that I can give, see the Freedman 2008 reference I cited earlier. In my opinion that paper is required reading for anyone interested in observational studies. I failed to provide a link earlier; hopefully this one works: https://projecteuclid.org/journals/statistical-science/volume-14/issue-3/From-association-to-causation--some-remarks-on-the-history/10.1214/ss/1009212409.full

why are we mixing up model uncertainty with random error based statistical inference?

We aren’t mixing them up. Model uncertainty is almost always completely ignored, including in the case under discussion. In this case, all the uncertainty is model uncertainty, and although “random error based statistical inference” is being reported, it has no apparent justification/interpretation. In other words, the uncertainty being conveyed is phony, while the real uncertainty is being hidden (and can’t be quantified anyway).

As I’ve noted elsewhere we should be wary of committing the Ludic Fallacy. Why should the mathematics of games of chance (probability theory) be expected to have any relevance to observational studies at all? See Harry Crane’s Naive Probabilism as well as the Kay and King (2020) book I cited previously. Link for Crane: https://researchers.one/articles/20.03.00003

s_doi · August 14, 2025, 6:10am

Good point, people tend to forget that we are quantifying random error based uncertainty and the true uncertainty will always be greater. However we work from the assumption that bias has been addressed before the UI was created which of course is not true. I cannot see anyway around this and Frank @f2harrell has also alluded to this in post 15. I believe those that ban P values and UIs (e.g. the editors of BASP) have not really achieved anything as its better to have an approximate metric than no metric at all. Keep in mind this has nothing to do with the frequentist - Bayesian divide as all methods be they UIs or credible intervals suffer from similar issues across the random error / systematic error divide.

Addendum: Actually I will have to reconsider the above thoughts as Sander once said in a paper that “...errors due to bias are not necessarily more costly than errors due to chance, and are certainly not infinitely more costly (as is implicitly assumed by limiting one’s attention to unbiased estimators” “https://doi.org/10.1093/biostatistics/2.4.463”

f2harrell · August 14, 2025, 10:47am

I do think that errors due to bias are often larger than errors due to chance, my Iraq example coming to mind.

ESMD · August 14, 2025, 2:17pm

I just read somewhere that “random error” includes sampling errors. But if this is true, then is there an error term in the statistical models that we use to interpret RCTs and observational studies that reflects sampling error?

s_doi · August 14, 2025, 2:21pm

Frank, are you in favor of unbiased estimators? Wouldn’t a biased estimator with a smaller MSE be preferable? For example in MA we only do it once for any question and we can trade off variance against bias to such an extent that the MSE drops compared to the unbiased arithmetic mean estimator. Isn’t that why we do not use the arithmetic mean and instead opt for a biased weighted average in MA?

f2harrell · August 14, 2025, 5:13pm

Don’t know where you got that idea. I don’t look at bias by itself, but RMSE, mean absolute error, median absolute error, etc. If estimating an exposure effect we want P(estimate close to truth) to be large and that is highly related to RMSE.

s_doi · August 14, 2025, 9:09pm

Okay we are on the same page here - must have misunderstood one of your posts

I am talking about probability models in this thread, not statistical models. A probability model is one single, fully-specified distribution. A statistical model is a set of possible probability models, from which one is assumed to have generated the data, but with unknown parameters.

ESMD · August 14, 2025, 9:59pm

Blah- I surrender. My brain hurts. Will stick to my day job. This is all just far too opaque…

s_doi · August 14, 2025, 11:32pm

Worth persisting on this as you could then pass this down the chain to other clinicians - this paper might help

ChristopherTong · August 15, 2025, 4:06am

If it’s any consolation, there are others who share your concerns about the meaning of statistical inference in such scenarios. See this commentary from Amstat News last year, that is particularly pertinent to the case of observational studies:

https://magazine.amstat.org/blog/2024/12/02/a-practicing-statisticians-plea/

(I have drafted a lengthy response to him, since I think he missed a couple nuances, but haven’t found a suitable place to post it. It’s too long to leave as a comment at the above site.)

s_doi · August 15, 2025, 11:02am

I think this author is mixing up random sampling, randomization and random error. To make it clear for clinicians perhaps this table can clarify this. The interval is based only on random error

Concept	What it involves	Goal	If absent

Random sampling	Selecting units from a population so every member has a known, nonzero chance of being chosen.	Achieve a representative sample so inferences generalize to the population.	Sample may be biased — can’t claim population representativeness.


Randomization (random allocation)	Assigning participants in a study to treatment or control groups by chance.	Balance both known and unknown confounders across groups for valid causal inference.	Groups may differ systematically → confounding risk.
Random error	Unpredictable variation in data due to chance factors (measurement noise, biological variability, sampling fluctuation).	Quantifies precision (via SE, UI, CI).	Estimates may be imprecise — more spread between repeated samples.

f2harrell · August 15, 2025, 11:19am

I see that @James_Stanley has started a new topic to discuss the uses and misuses of regression paper.

ESMD · August 15, 2025, 10:39pm

Thanks very much Chris, for all this great reading material! I read the following articles that are cited in the AmStat letter you mention:

Hubbard et al: https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1464947#abstract

Lots of good stuff here, though some of the statements about medical RCTs miss the mark (e.g., the assertion that subgroup analysis is rarely performed…).

And then there’s your article from 2019:

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1518264#abstract

What a tour de force ! It reminds me of Pavlos’ magnum opus (“Interpreting Randomized Controlled Trials”). I really appreciate comprehensive, simply written articles like this, which draw on the history of a field. The writing is so clear. It’s only when you read papers like this that you realize how utterly bewildering the other 99.9% of stats/epi publications are.

The overwhelming feeling I get when I try to understand most publications in these fields is of being gaslighted. In one ear, some experts are whispering “everything is just fine/interpretable, we understand our subject is hard to grasp, but you’ll understand some day; you just need to read a few hundred more hieroglyphics- filled articles/texts…,” while in the other ear other experts are whispering “forget about all the hieroglyphics- our field is in an advanced state of decay because we’ve all been sweeping major conceptual distortions under the rug for decades.”

I’m not going to spend any more time agonizing over the meaning of “95% confidence intervals” in observational studies. I just don’t see the point. Plainly, there are many experts who do consider them to be (somehow) interpretable and valuable- otherwise, we wouldn’t see them all over the epi literature- right (?) But, given their historically exalted status with regard to decision-making, it’s utterly mortifying that, at the same time, someone like you (and, evidently other expert statisticians and possibly other epidemiologists), does NOT understand their meaning or consider them to be interpretable. I used this analogy somewhere else, but this divide is analogous to neurosurgeons disagreeing with neurologists about neuroanatomy (!) How is such a fundamental divide even possible among experts, and what does this say about the state of stats/epi education and the prospects for students (and clinicians) to understand these concepts ??? This seems like a very sad state of affairs …

davidcnorrismd · August 16, 2025, 9:52am

I remember a Family Medicine instructor once noting that the proliferation of OTC products in the cough-and-cold aisle at the pharmacy tells us that none of them can possibly be very effective.

f2harrell · August 16, 2025, 11:50am

Do note that @ChristopherTong ‘s wonderful paper is mostly related to a frequentist perspective. Bayesian models can take model uncertainty into account, when practitioners are diligent to add the extra parameters needed.

R_cubed · August 16, 2025, 12:25pm

This was going to be my criticism. I’ll only add that your Bayesian solution will not be satisfactory to the dogmatic frequentists because it ends up being “subjective”.

Taken literally, the recommendations in that paper cited by Christopher Tong would require scientists to so severely discount any source of data that he or she did not personally collect, it might as well be treated as having no credibility. That would include meta-analysis of RCT’s, since meta-analyses of RCTs are also merely “observational”.

Berlin JA, Golub RM. Meta-analysis as Evidence: Building a Better Pyramid. JAMA. 2014;312(6):603–606 link

JAMA considers meta-analysis to represent an observational design, with measures that should be interpreted as associations rather than causal effects.

I thought Efron’s old paper “Why isn’t everyone Bayesian” settled the dispute on foundations, where commentators agreed that the subjectivists had the best arguments, but practice dictated use of other procedures that were close approximations.

Efron, B. (1986). Why isn’t everyone a Bayesian?. The American Statistician, 40(1), 1-5. link

If the frequentists are going to complain about the Bayesian prior, why should the Bayesian meekly accept that a study that claims to use randomization is inherently more credible than any other type of formal model for the data collection procedure, when other claims from that same agent are treated with doubt, if not strong skepticism?

The Bayesians in recent times (with an important exception) have been too charitable in conditioning on the reports of others that claim to use randomization. Rethinking this naive acceptance of reports that claim to use randomization can lead to an impasse – how does a community of scientists collect data and conduct experiments to decide questions of fact that leads to a convergence of opinion, which seems to be the explicit goal of scientific inquiry?

The field of mechanism design provides some answers, but none of this appears to be on the radar of statisticians. It would unify the frequentist and Bayesian perspectives on the design of experiments by providing a common language to discuss goals and methods using game theoretic concepts, which have already been used in theoretical statistics.

The key idea of mechanism design is identifying goals first and then attempting to design a system that achieves those goals. In other words, at the beginning of the process, the goals are given, and the ideal mechanism is the unknown. This contrasts with “positive” or predictive economics, which studies the actual or likely outcomes of a given system. In that case, the system is the given, and the outcomes are the unknowns.

The closest paper that comes to approaching this Frequentist/Bayesian dispute from a mechanism design point of view (without realizing it) is a paper from 2014 by Atkinson, that discusses biased coin designs, which I mentioned in this thread:

Atkinson AC. (2014). Selecting a Biased-Coin Design, Statistical Science, Statist. Sci. 29(1), 144-163 link

Biased-coin designs are used in clinical trials to allocate treatments with some randomness while maintaining approximately equal allocation. More recent rules are compared with Efron’s [Biometrika 58 (1971) 403–417] biased-coin rule and extended to allow balance over covariates. The main properties are loss of information, due to imbalance, and selection bias. Theoretical results, mostly large sample, are assembled and assessed by small-sample simulations. The properties of the rules fall into three clear categories. A Bayesian rule is shown to have appealing properties; at the cost of slight imbalance, bias is virtually eliminated for large samples.

Recommended Reading

Briggs is a bit extreme in his critique, but his fundamental point is valid, and his argument that “randomization” is a religious ritual that “blesses” a data set can explain the rise of “non-comparative randomized trials” discussed in this other thread:

The following should be studied together, as it presents a coherent way to do probabilistic bias analysis that @Sander has advised for decades, based on Bayesian Trust modelling (which is used in information security contexts):

Josang, A., Hayward, R., & Pope, S. (2006). Trust network analysis with subjective logic. In Conference Proceedings of the Twenty-Ninth Australasian Computer Science Conference (ACSW 2006) (pp. 85-94). Australian Computer Society. link

Of course this 2005 paper by @Sander_Greenland remains relevant for this thread. Note he also mentions the importance of bias analysis in the context of randomized studies.

Greenland, S. (2005). Multiple-bias modelling for analysis of observational data. Journal of the Royal Statistical Society Series A: Statistics in Society, 168(2), 267-306. PDF

This paper by Philip Stark is also worth re-examination (updated for the actual published citation):

Stark, P. B. (2022). Pay no attention to the model behind the curtain. Pure and Applied Geophysics, 179(11), 4121-4145. link

s_doi · August 16, 2025, 12:53pm

The paper by @christophertong argues that formal statistical inference “should play no role” in most science, reserving it mainly for fully prespecified confirmatory settings. I think that is too restrictive and suggest the following as a guide for clinical users of research evidence:

1. Focus on effect size, not just divergence.

Look at the range of the UI to see what effect sizes are plausible.

Example: A drug lowers blood pressure by –8 mmHg (95% UI –12 to –4) is clinically meaningful.

If the UI is –1 to +7, the effect is uncertain: it could even be harmful or trivial.

2. Treat UIs as a measure of precision , not absolute truth.

A narrow UI means the estimate is more precise.

A wide UI means the result is noisy, regardless of whether it excludes the null.

Remember: precision ≠ accuracy — a narrow UI around a biased estimate is still misleading.

3. Remember model assumptions.

UIs assume the model is correctly specified (variables chosen, distribution assumptions, etc.).

If the analysis was tweaked after seeing the data, the UI is often too narrow (overconfident).

4. Use UIs for comparison across studies.

Overlapping UIs across independent studies suggest results are consistent.

If different studies give UIs pointing in opposite directions, results are not reliable yet.

5. Don’t confuse UI width with clinical importance.

A very narrow UI around a tiny effect (e.g., –0.3 to –0.1 mmHg) is precise, but trivial for practice.

Always ask: is the range of plausible effects clinically meaningful?

6. Be alert in exploratory research.

In early, flexible analyses, treat UIs as descriptive: they show variability in the data you have, not a guaranteed measure of uncertainty.

Confirmation should come from replication, not just a UI that looks convincing in one study.

R_cubed · August 16, 2025, 5:31pm

It just so happens @Sander_Greenland has published a recent paper of great relevance to this thread.

@ESMD: the bolded portion of this portion of the abstract is the best description I’ve seen on how to practically interpret the statistics from medical research papers. Hopefully this answers your question.

The study of associations and their causal explanations is a central research activity whose methodology varies tremendously across fields. Even within specialized subfields, comparisons across textbooks and journals reveals that the basics are subject to considerable variation and controversy. This variation is often obscured by the singular viewpoints presented within textbooks and journal guidelines, which may be deceptively written as if the norms they adopt are unchallenged. Furthermore, human limitations and the vastness within fields imply that no one can have expertise across all subfields and that interpretations will be severely constrained by the limitations of studies of human populations. The present chapter outlines an approach to statistical methods that attempts to recognize these problems from the start, rather than assume they are absent as in the claims of ‘statistical significance’ and ‘confidence’ ordinarily attached to statistical tests and interval estimates. It does so by grounding models and statistics in data description, and treating inferences from them as speculations based on assumptions that cannot be fully validated or checked using the analysis data.

Also relevant:

Rovetta, Alessandro et al. (2025) p-Values and confidence intervals as compatibility measures: guidelines for interpreting statistical studies in clinical research The Lancet Regional Health - Southeast Asia, Volume 33, 100534 link