How to interpret “confidence intervals” in observational studies

DylanArmbruster · September 5, 2025, 11:36pm

To Frank:
That feels a bit cryptic to me. The advice is from William G. Cochran and it’s been around for a long time. If this is your answer though, I’ll tell you that the advice revolves around randomization and the goal of an experiment. Mimic the result of randomization of treatment and try to answer a causal question.

That is a shame because the many Observational Studies I read about in say, Nutrition, do perform sensitivity analyses. Many of them posit alternative explanations for their results. They note several limitations and mention that we should perform more studies.

DylanArmbruster · September 5, 2025, 11:44pm

To ChristopherTong:

That sounds pretty counterintuitive to me. Drop the course that gives young statisticians like myself a formal framework for reasoning with data for a study of history of science. Not sure about that one. As a enjoyer of history though, I think supplementing statistics courses with Stephen Stigler’s books is a great idea. I find I remember things easier when I know a little about the history. A book that Frank recommends, Statistical Rethinking, has some history in it which I very much appreciated.

ChristopherTong · September 6, 2025, 12:31am

Yes, but his Wikipedia page claims his PhD was in physics, but then his research interests were in biophysics and molecular biology. The NY Times obituary cited there is headlined “Scientific Scholar in Many Disciplines”.

ChristopherTong · September 6, 2025, 12:55am

My comment was deliberately counterintuitive. The framework you’re talking about has often been used to short circuit the actual process of scientific discovery. That is what this entire thread is about. Statisticians of all ages are generally not exposed to the long history of scientific knowledge-building that occurred before Fisher’s 1925 paper, and still continues to occur in some enlightened fields. If statistics is the “science of learning from data”, statisticians have strangely ignored how scientists have actually learned from data for centuries. A few exceptions include David Freedman and George Box, who were both interested in the history of science, engineering, and medicine (as apart from the kind of history Stigler wrote about).

Please allow me to suggest some reading that would give you a more eloquent perspective than I can.
D. Freedman, 2008: From association to causation: some remarks on the history of statistics. Statistical Science, 14(3): 243-258. https://www.jstor.org/stable/2676760

G. Gigerenzer and J. Marewski, 2014: Surrogate science: the idol of a universal method for scientific inference. Journal of Management, 14(2): 421-440. https://doi.org/10.1177/0149206314547522

ChristopherTong · September 6, 2025, 1:06am

“I once suggested in discussion at a statistical meeting that it might be well if statisticians looked to see how data was actually analyzed by many sorts of people. A very eminent and senior statistician rose at once to say that this was a novel idea, that it might have merit, but that young statisticians should be careful not to indulge in it too much, since it might distort their ideas. The ideas of data analysis ought to survive a look at how data is analyzed….If data analysis is to be well done, much of it must be a matter of judgment, and ‘theory’, whether statistical or non-statistical, will have to guide, not command.”

– John W. Tukey, 1962: The future of data analysis. Annals of Mathematical Statistics, 33: 1-67.

EDIT: The following quotes from Philip B. Stark, distinguished professor of statistics at UC Berkeley, are also relevant to this thread:

“A statistician’s most powerful tool is randomness—real, not supposed.”
“Beware hypothesis tests and confidence intervals in situations with no real randomness.”

Source: Miscellany about P.B. Stark

s_doi · September 6, 2025, 8:21am

I think what has been mixed up here by Stark is external and internal validity - terms well described in the epidemiological literature. To decide if a target of estimation for a comparison corresponds to the true population value of interest (i.e. is on-point or off-point), we need to assess comparative plausibility. The latter refers to whether the comparison between groups is sufficiently free of systematic differences, such as confounding, selection bias, or differential measurement, so that the observed difference can reasonably be attributed to the intervention rather than to group differences that introduce bias. The latter is a property of the comparison rather than of the individual groups and it is very important to note that comparative plausibility does not require each group or the overall study population be representative of the target population. Groups or the overall study population may both be off-point relative to the real-world population yet could still yield a plausible comparison if the groups are similarly off-target – this similarity is what we call internal validity in epidemiology. A very common misconception is that random sampling or random allocation is required for such comparative plausibility, or that the two are the same. In reality, random sampling pertains to external validity and how well the groups or the study population represents the broader population, whereas random allocation promotes internal validity by supporting balance between groups. Well-designed observational studies can achieve comparative plausibility without randomization, while poorly executed randomized trials can fail to do so. Here we are talking about analytical studies not descriptive studies where inference back to a population (whatever that is) is required.

As I have said repeatedly above, precision information derived from the UI must be interpreted in the context of the comparative plausibility of the study target of estimation (meaning that the target the study is estimating is aligned with the real-world reality). I agree fully that in practice, we never know whether the target is perfectly on-point or somewhat off-point. What matters is that imprecise intervals rarely change decisions, because the wide range of plausible values already makes the evidence weak. But when an interval is narrow, its clinical impact depends critically on whether the target is comparatively plausible (internally valid): if so, it provides a higher level of evidence in its estimate of the right thing; if not, it is a low level of evidence because it risks giving the false impression of certainty about the wrong thing. For this reason, a single precise interval has conventionally not been treated as decisive for patient care. The proper role for all research evidence is as a contribution to the accumulation of evidence on the question being answered, where multiple studies, examined together, clarify whether the estimate consistently points in one direction before being used to guide clinical decisions.

f2harrell · September 6, 2025, 11:19am

What’s difficult about understanding that attempting to mimic something doesn’t make it something?

It would indeed be interesting to find a literature review to determine frequencies of usage of certain methods in nutritional epidemiology. My experience does not match yours. When I look at papers in that field I do see many mentioning alternative explanations as you said, but I do not see a majority of them do anything about it. Words are pretty cheap.

f2harrell · September 6, 2025, 11:27am

These are true in general, but the probability of being close to a meaningful truth is strongly related to the design, including blinding. Perhaps more importantly, we speak of observational studies as if they are of one type. There is a massive distinction between designed prospective OS with protocoled data collection and retrospective OS which almost always uses data collected for another purpose (EHR data are collected for bill collection). Non-designed OS are very, very seldom appropriate for therapeutic comparisons of risk factor determination.

s_doi · September 6, 2025, 1:29pm

Indeed that seems to me the crux of the problem with observational studies. Take three time-related biases: immortal time, lead time and survivorship bias, they can skew the results as we saw with the WHI studies. I agree with you that designed prospective OS with protocoled data collection is the gold standard. However we have masses of accumulated retrospective data and that is why Hernan and colleagues have suggested “emulation of a target trial” methodology to try and improve the design of the OS from retrospective data.

DylanArmbruster · September 6, 2025, 6:40pm

To ChristopherTong:

In what way is the framework I’m talking about short circuiting the actual process of scientific discovery? This is getting a bit philosophical now because now you’re acting as if there is only one way to do scientific discovery. This is also a bit tangential to the OP which was about interpreting CI’s without the use of randomization.

How are statisticians ignoring how scientists have learned rom data for centuries? If anything, Statisticians have improved on how scientists learn from data. You can see several accounts of this in Fisher’s book on Design of Experiments.

I’ve read that paper from Freedman. I own three of Freedman’s books and have read several of his papers, along with George Box. I have not read that second paper you linked so I’ll gladly read it.

I think both Scientists and Statisticians ought to read more Philosophy(No excuse for Statisticians not to when we are already ploymaths by default).

Thank you,

DylanArmbruster · September 6, 2025, 6:51pm

To Frank:

What’s unclear is why you brought up Frequentist Statistics which I didn’t understand why that was necessary. I was wondering what you took, “design an Observational Study as if it were a randomized experiment” to mean.

Paul Rosenbaum and Donald Rubin don’t make circular arguments in their proofs. No one is claiming that A is B because they say it is so.

What do you mean when you say the authors don’t do anything about it? Like, what are you expecting exactly?

Have you seen meta papers like this Evaluating agreement between individual nutrition randomised controlled trials and cohort studies - a meta-epidemiological study | BMC Medicine | Full Text ?

f2harrell · September 6, 2025, 7:10pm

This is misleading in how it is being used because it doesn’t help with confounding. The problem started with the choice of words. Emulate means to copy and most of the most important features of RCTs are not being copied.

f2harrell · September 6, 2025, 7:11pm

@DylanArmbruster
You are ignoring several things I wrote earlier in this thread and I’m not criticizing designing an OS to be like an RCT. I’m criticizing pretending to do so in undesigned studies.

R_cubed · September 6, 2025, 7:19pm

While I strongly agree with @f2harrell in his support for subjective Bayesian foundations and acknowledge that a Bayesian may even have a valid reason to employ frequentist procedures such as randomization, I suspect I’m even more radical.

My complaint is that Frequentist statistical practice engages in cognitive misdirection by deluding readers into believing they know the data generation mechanism, when the only one who can truly know it, is the experimenter.

The rest of us place trust in the honesty of the report, but frequentist aversion to epistemic probability does not permit modelling that. So we end up with everyone de facto using “spike” priors on parts of the model of the data generation process, and having false confidence in the results. A formal model of false confidence (which both Bayesian and Frequentist methods can fall into) is discussed in detail in this paper from the Royal Society:

Balch, M. S., Martin, R., & Ferson, S. (2019). Satellite conjunction analysis and the false confidence theorem. Proceedings of the Royal Society A, 475(2227), 20180565. https://royalsocietypublishing.org/doi/full/10.1098/rspa.2018.0565

From a computational perspective that is used in cryptography, this tacit trust placed in reports of randomization procedures done outside of your own control, places the scientific community at risk of being mislead by a rogue scientist, which there are a number of case studies I’m currently collecting.

The Bayesians since Gossett were correct in calling attention to the major limitations of randomization, appears to have all but been forgotten.

Related thread:

DylanArmbruster · September 6, 2025, 7:27pm

To Frank:

That’s possible that I’m missing what you wrote somewhere. I’m not sure what that is though. What do undesigned studies look like exactly?

DylanArmbruster · September 6, 2025, 7:52pm

To R_cubed:

Was this meant to be a response to what I said?

s_doi · September 6, 2025, 7:57pm

Yes, the design does not help with confounding but we can adjust for confounders derived from a minimal adjustment set based on a DAG in some analytic scenarios but Hernan and colleagues have no need for that as they use the clone-censor-weighting approach. It may however have its own problems relating to choice of the grace period and perhaps computing the weights.

R_cubed · September 6, 2025, 8:07pm

It is in reply to your general defense of Fisher as the be all and end all of experimental design. There are crucial distinctions on when randomization is a reasonable method and when it is not, which are now more clear in light of the mathematics of secure communication channels, than it might have been when Fisher was alive.

DylanArmbruster · September 6, 2025, 8:12pm

To R_cubed:

Fisher is not and nor did I claim that he was the be all or end all of experimental design. So you’re shadow boxing. I read modern texts as well like John Matthews book on Clinical Trials. Modern experimentalists still site Fisher’s Design of Experiments book as a must read.
I would prefer to stick to the OP’s topic.

ESMD · September 6, 2025, 8:15pm

As I have said repeatedly above…

To be frank, Suhail, your explanations of internal and external validity just aren’t resonating with me. They really don’t address the key problem- namely, that generation of confidence intervals must hinge on an underlying random process. And there is no randomness anywhere in an observational study. My overwhelming impression at this point is that this is a statistical practice that has been egregiously misapplied for many decades.

As a clinician, not a statistician or epidemiologist, I pleaded ignorance right back at the start of this thread. But your assertion that this is all a “misunderstanding” by expert contributors to the thread doesn’t jibe with the fact that citations from multiple other eminent experts over the past 50 years have also clearly flagged the same problem. It’s just that nobody, evidently, listened to them- likely because 1) what they were saying was too threatening to the status quo; and 2) nobody could be bothered to learn/propose another way of doing things…

I imagine that it would be very difficult to acknowledge that the way a profession has been doing things for decades might be fatally flawed. But such admissions are necessary for progress to occur. The ways in which certain medical procedures were performed 30 years ago seem mortifying to today’s students and residents. Medicine advanced and got better over time, and patients are safer for these changes. Somebody, somewhere seems to have been the first person to mesmerize the medical community into believing that confidence intervals can be applied in settings that involve no hint of randomness. I’m still trying to find out who that was. I feel like I’ve gotten close, several times, in my hunt, but still haven’t found an outright original defence of their use in observational studies. Is it possible that nobody ever explicitly said they could be used for observational studies (?!) The mere thought seems crazy, doesn’t it (?) Is it possible that the multiple “guidances” published in the 1980s, which recommended how statistics should be presented medical journals, simply sidestepped the issue completely (???) And that observational researchers at that time therefore simply assumed that they could start extrapolating the use of confidence intervals to the observational context in order to get their work published (??) And, since nobody called them out on the practice, it just continued unfettered for the next 40 years (??) If anyone knows of an explicit defence of this practice, please add a link to this thread. Unless a reasonable defence has been published somewhere, I’ll be left with the impression that observational research has even bigger problems than I thought…