How to interpret “confidence intervals” in observational studies

DylanArmbruster · September 6, 2025, 8:36pm

To ESMD:

Randomization of treatment is absent in Observational studies true. But with methods like matching, propensity scores, weighting you can construct a hypothetical randomized experiment. The challenge for the Observational Researcher is whether or not they were successful in creating groups that are really are comparable. Not just groups that look comparable, but actually aren’t because of some unmeasured covariate Z. And you can defend yourself with Sensitivity Analysis. You should check out Paul Rosenbaum’s books.

I tried to look for papers where authors specifically address this topic but I couldn’t find any even with the use of LLMs.

ESMD · September 6, 2025, 9:00pm

Dylan- thanks for participating in the thread. I have pleaded ignorance relative to experts on this site, but not complete ignorance of epi and stats. Research methods have been a personal interest since my medical training in the early 1990s (the “dawn of EBM”). I worked part-time for a decade in a regulatory agency job that required very frequent medical causality assessment and detailed reviews of observational evidence, including assessments of how/when/whether it should affect clinical practice. Finally, I’ve been practising medicine for 25 years, referring daily to various types of medical evidence in making therapeutic decisions. So I’m quite aware of observational research methods that are available. Again though, these methods are not the topic of this thread.

Suggestion: If you’re a grad student, question what you’re being taught. As is true in any field that deals with complex subject matter, not everything you’re taught is still going to be considered good practice by the later stages of your career. Ask yourself why you’re having so much trouble finding a publication, written by a statistical expert, which explicitly defends the use of confidence intervals in observational studies…To be clear- I’m not talking about articles that advise constructing confidence intervals for observational studies, but rather a detailed statistical defence of the practice…

DylanArmbruster · September 6, 2025, 9:10pm

To ESMD:

Thank you for the background information on you. If the ideas I have above are controversial or wrong, you can be happy to hear none of this is being taught at my school. I don’t think most schools have a class on Design of Observational Studies. Only Design of Experiments. Most of what I wrote above is borrowed from material I’ve read on my own. I’m having trouble finding publications for either side actually. I believe there is one paper where Sander Greenland talks about this very briefly. That’s it. It could be the case that papers exist, and I’m just unable to find them with the tools I have or with a search critieria that misses them.

ESMD · September 6, 2025, 9:22pm

Actually, Dr.Greenland seems to have some pretty clear opinions on this topic- see his article cited in post #132 above. So we have AT LEAST Drs.Greenland, Harrell, Tong, Feinstein (whose text is cited above), and other experts whose papers are cited in earlier posts in this thread, who all seem to have major concerns with the use, and therefore interpretation, of confidence intervals in observational studies. And yet their use continues unchecked to this day. A bit unnerving- no?

DylanArmbruster · September 6, 2025, 9:50pm

To ESMD:

I’ll read that paper but I was more so looking for papers where that was the aim of the paper. I need to convince myself that the concerns are valid. Otherwise, I’d have a new opinion everyday.

ChristopherTong · September 7, 2025, 2:40am

I think we can agree that statisticians have improved methods of learning from data on the Design of Experiments side, including the work of R.A. Fisher, Austin Bradford Hill, and others, including the stuff about propensity scores etc that you mention.

However, this thread is about the statistical inference side, which is the part enabling the short circuiting. As you know from your reading, Freedman and Box repeatedly emphasized the need for “shoe leather” (Freedman), “iterative learning cycle” (Box), or generally the process of building evidence with “many sets of data” (Ehrenberg) along different lines of evidence (“triangulation”, per Munafo & Davey Smith) and avoiding “The Cult of the Isolated Study” (Nelder). I discuss all of these and more in a paper cited in Post 34 above; see especially Sec. 4 of my paper. Much of the history of science follows this path, especially prior to 1925. However, the use of UIs and p-values has allowed many to “interpret” the findings of a single study as if we are ready to use the findings for clinical decision making. It’s even worse when the statistical inference is calculated for an observational study (for reasons explored in this thread), especially as many lack the careful design features that you mentioned. (I attempted to dissect one such on PubPeer: PubPeer - Effect of environmental pollutants PM-2.5, carbon monoxide,... )

Statisticians are not generally taught any history of science, and thus lack perspective on how science is (was) done in practice. I would argue that philosophy has had a negative influence on statistics because the “hypothetico-deductive” approach of the philosophers has led to an obsession with null hypothesis testing in statistics, that has in turn distorted scientific publishing standards for everyone else. See Dr. Greenland’s comments in https://doi.org/10.1093/aje/kwx259

I would further recommend looking at the papers by Nelder (esp. his Sec. 4) and Ehrenberg that I cited in Sec. 4 of my paper. Sorry to keep adding to your pile of reading, but these are subtle issues and the people I cite have explained them more eloquently than I have.

ChristopherTong · September 7, 2025, 2:58am

Some quotes from Freedman and Box that you may have seen before. Boldface is my emphasis.

“Statisticians generally prefer to make causal inferences from randomized controlled experiments, using the techniques developed by [Ronald A.] Fisher and [Jerzy] Neyman. In many situations, of course, experiments are impractical or unethical. Most of what we know about causation in such contexts is derived from observational studies. Sometimes, these are analyzed by regression models; sometimes, these are treated as natural experiments, perhaps after conditioning on covariates. Delicate judgments are required to assess the probable impact of confounders (measured and unmeasured), other sources of bias, and the adequacy of the statistical models used to make adjustments. There is much room for error in this enterprise, and much room for legitimate disagreement.

”[John] Snow’s work on cholera, among other examples, shows that sound causal inferences can be drawn from nonexperimental data. On the one hand, no mechanical rules can be laid down for making such inferences. Since [David] Hume’s day, that is almost a truism. On the other hand, an enormous investment of skill, intelligence and hard work seems to be a requirement. Many convergent lines of evidence must be developed. Natural variation needs to be identified and exploited. Data must be collected. Confounders need to be considered. Alternative explanations have to be exhaustively tested. Above all, the right question needs to be framed.

“Naturally, there is a strong desire to substitute intellectual capital for labor. That is why investigators often try to base causal inference on statistical models. With this approach, P-values play a crucial role. The technology is relatively easy to use and promises to open a wide variety of questions to the research effort. However, the appearance of methodological rigor can be deceptive. Like confidence intervals, P-values generally deal with the problem of sampling error not the problem of bias. Even with sampling error, artifactual results are likely if there is any kind of search over possible specifications for a model, or different definitions of exposure and disease. Models may be used in efforts to adjust for confounding and other sources of bias, but many somewhat arbitrary choices are made. Which variables to enter in the equation? What functional form to use? What assumptions to make about error terms? These choices are seldom dictated either by data or prior scientific knowledge. That is why judgment is so critical, the opportunity for error so large and the number of successful applications so limited.”

Source: Excerpt from the final (“Summary and Conclusions”) section of David Freedman (1999), “From Association to Causation: Some Remarks on the History of Statistics”, Statistical Science, 14 (3): 243-258.

“Statistics has no reason for existence except as a catalyst for scientific enquiry in which only the last stage, when all the creative work has already been done, is concerned with a final fixed model and a rigorous test of conclusions. The main part of such an investigation involves an inductive-deductive iteration with input coming from the subject-matter specialist at every stage. This requires a continuously developing model in which the identity of the measured responses, the factors considered, the structure of the mathematical model, the number and nature of its parameters and even the objective of the study change. With its present access to enormous computer power and provocative and thought-provoking graphical display, modern statistics could make enormous contributions to this – the main body of scientific endeavour. But most of the time it does not.”

Source: Excerpt from G. E. P. Box’s Discussion of David Draper (1995), “Assessment and Propagation of Model Uncertainty” (with discussion), Journal of the Royal Statistical Society, Series B, 57 (1): 45–97.

“Statistics is, or should be, about scientific investigation and how to do it better, but many statisticians believe it is a branch of mathematics….

“So I think it ludicrous to suppose that anyone who has no experience of real scientific inquiry is qualified to teach or research statistics. Unhappily, many of the people in our most prestigious universities who are teaching future statisticians and conducting research in statistics are precisely in this category….”

“When I talk to engineers and physical scientists whom I am hoping to persuade to give statistical methods a chance, I have come to dread the comment ‘Yes, I once took a course in statistics,’ because I know it usually means that instead of starting from scratch, I now must start with a severe handicap….”

“Much philosophical argumentation about the nature of statistical inference is, I believe, irrelevant because it contains the hidden but profound assumption of a one-shot approach, in spite of the fact that the majority of scientific investigations follow an iterative and adaptive sequence. Since the inductive-deductive iteration which is scientific method cannot be readily fitted into a purely mathematical model, we too often agree to concentrate on the deductive bit we can study mathematically and pretend that the rest does not exist. This cuts the investigatory process in two and kills it. Students absorb the impression that models are true and data are wrong instead of the other way around. The assumption of normality is stressed but the devastating effect of dependence in space and time in our essentially nonstationary world is not. The dominant role of the design of experiments and surveys, as compared with their analysis, is not understood…”

“It seems a pity that while we statisticians have an opportunity to rate as first-class scientists we should settle for the rather dreary role of second-class mathematicians. So I am not surprised by the situation we are in. I believe scientists and engineers have little time for statistics because they judge much of it to be irrelevant to what they are doing, and they are right.”

Source: Excerpts from G. E. P. Box’s Commentary on A. Bruce Hoadley and J. R. Kettenring (1990), “Communications Between Statisticians and Engineers/Physical Scientists” (with discussion), Technometrics, 32 (3): 243-274.

s_doi · September 7, 2025, 11:39am

Erin, the problem is that the term randomness is very vague the way you are using it. You need to be more precise. If you mean “random sampling” from a target population of interest or “random allocation” of interventions, then neither are mandatory for interpretation of the UI in an analytical study. There is no controversy by anyone on this thread, neither Chris, nor Frank nor anyone else. What they are saying is that there is a huge tendency for misinterpretation and there is also the problem of the statistical model being wrong so we have to be careful in interpretations and perhaps use analyses that make assumptions more transparent.

ESMD · September 7, 2025, 12:11pm

As far as I can tell, they are mandatory…

As noted by Dr.Greenland in the randomization vs random sampling thread linked in the first post:

The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).

Where is the “isomorphism” between random sampling and observational studies?

This is a fascinating interpretation of this thread. There is, indeed, very little difference between Drs.Harrell and Tong’s positions. But your position is very different from theirs and from the authors cited in multiple posts. If you can cite another expert (besides yourself) who explicitly and substantively defends the rationale behind extrapolation of confidence intervals to observational research, please provide it and put me out of my misery.

f2harrell · September 7, 2025, 12:27pm

To oversimplify, a undesigned study is one in which data have already been collected before the statistical analysis plan is written. A designed study must have these attributes:

a protocol has been written and the protocol dictates how and when data are collected
the protocol especially lays out how outcomes are collected/measured
many disinterested clinical experts have been polled to determine their list of variables used in clinical decisions (treatment selection) and all of these variables are collected reliably with minimal missing data, and adjusted for in any OS treatment comparison
outcomes are assessed by persons who are blinded to treatment choice

f2harrell · September 7, 2025, 12:33pm

This is a fantastic analogy. My personal take on the situation is that the statistical community should be deeply embarrassed by what we have done in the past 100 years, starting perhaps when Fisher condemned Bayes. We have been unable to get out of this hole and have instead clung to the status quo, with major assistance from statistical program curriculum designers. The medical equivalent would be continuing to use a 100 year old treatment for a condition in which newer treatments are known to work much better. Then there is the reward system that benefits sloppy observational researchers who grab headlines with “findings”.

f2harrell · September 7, 2025, 12:37pm

This way of thinking (exclusive of the sensitivity analysis part) is in my humble opinion causing a lot of misunderstanding and false comfort in OS. The problem with sensitivity analysis is that so few epidemiologists are actually doing them. I have pleaded with my own local colleagues to do this for decades with no success. It might cut into their ability to report that some food consumption prevents some disease.

To deal with one part of the problem, it is misleading to call propensity score analysis a causal inference method. It is nothing of the sort. It is a “confounder focusing method”, i.e., a data reduction technique that merely projects a series of measured potential confounders into a single number, which then misleads researchers into ignoring within-group outcome heterogeneity so their final odds ratios are attenuated towards 1.0.

ESMD · September 7, 2025, 1:11pm

The cold, hard truth is that clinicians have been “onto” the sloppiness in much (but not all) modern observational research (especially retrospective/database studies) for a very long time. Constant, loud assertions by researchers of “major discoveries” from this type of data are completely implausible, so we tuned them out long ago. So, if clinicians, with our (mostly) limited understanding of research methods, have already independently decided to disregard large swathes of the medical literature, what impact can they possibly have, other than personal gain for the researchers involved?

If everyone is happy with personal gain, rather than scientific advance, as the main contribution of much observational research to clinical medicine, then, by all means, carry on… Otherwise, it’s time to roll up some sleeves, start learning and teaching some new methods of study design and analysis, launch some serious cross-disciplinary collaboration, and delete the number for the university press office.

DylanArmbruster · September 7, 2025, 2:31pm

To ChristopherTong:

I appreciate how polite you have been to me. I think we could chat somewhere else about some of this stuff.

I actually did think you were talking about Statistical inference. And that is exactly what I’m talking about as well. I do not agree that is is enabling the short circuiting. I will read your paper after reading the other one you sent me. Those cases you mention where people will say, use inferential statistics like you would a math proof is wrong and I can assure you we are not taught this at the undergrad or grad level. Though, given my understanding of Inferential Statistics, the history of philosophy is absent. So many young statisticians will need to learn that on their own as I have. The cases you bring up are of untrained individuals. If you notice many people using a hammer wrong, do we blame the hammer? I knew many students in my undergrad that didn’t have any Analysis classes that only had a very poor understanding of what an integral was. Do we stop teaching about integrals? They could compute an integral of course. But the interpretation and understanding was not there. They needed the historical and deeper understanding that is not offered in our calculus series. Only Real Analysis.

My school at least is a more applied school. It might just not fit into the curriculum. I wouldn’t be apposed to it personally as I already read about it on my own. Which philosophers are you thinking of when you say this? I actually don’t know too many that like NHST, though I have seen some defenses of it. But it’s mainly defesnese of bad arguments, not so much that they advocate for it. There is a great paper in my opinion that has a defense of NHST which was a response to Jacob Cohen’s famous paper (In Praise of the Null Hypothesis Statistical Tests).

Of all the people you’ve mentioned, I’ve probably read more Sander Greenland papers(there are so many!). I’ve read this one already.

You’re fine. I appreciate the links. I think we could discuss this stuff in a different thread though.

DylanArmbruster · September 7, 2025, 2:38pm

To Frank:

f2harrell:

DylanArmbruster:

That’s possible that I’m missing what you wrote somewhere. I’m not sure what that is though. What do undesigned studies look like exactly?

To oversimplify, a undesigned study is one in which data have already been collected before the statistical analysis plan is written. A designed study must have these attributes:

a protocol has been written and the protocol dictates how and when data are collected

the protocol especially lays out how outcomes are collected/measured

many disinterested clinical experts have been polled to determine their list of variables used in clinical decisions (treatment selection) and all of these variables are collected reliably with minimal missing data, and adjusted for in any OS treatment comparison

outcomes are assessed by persons who are blinded to treatment choice

I’ll have to think about that then. I don’t know how common 4 is in the papers I’m thinking of. Do you have an example of an OS that satisfied all 4? You said must so I assume these are all required.

I’m just getting used to commenting more on your website here. Sorry if anything hasn’t been hard to follow because I’m not adding in the block quotes.

DylanArmbruster · September 7, 2025, 2:52pm

f2harrell:

DylanArmbruster:

But with methods like matching, propensity scores, weighting you can construct a hypothetical randomized experiment. The challenge for the Observational Researcher is whether or not they were successful in creating groups that are really are comparable. Not just groups that look comparable, but actually aren’t because of some unmeasured covariate Z. And you can defend yourself with Sensitivity Analysis. You should check out Paul Rosenbaum’s books.

This way of thinking (exclusive of the sensitivity analysis part) is in my humble opinion causing a lot of misunderstanding and false comfort in OS. The problem with sensitivity analysis is that so few epidemiologists are actually doing them. I have pleaded with my own local colleagues to do this for decades with no success. It might cut into their ability to report that some food consumption prevents some disease.

To deal with one part of the problem, it is misleading to call propensity score analysis a causal inference method. It is nothing of the sort. It is a “confounder focusing method”, i.e., a data reduction technique that merely projects a series of measured potential confounders into a single number, which then misleads researchers into ignoring within-group outcome heterogeneity so their final odds ratios are attenuated towards 1.0.

What about what I said is a misunderstanding of OS’s? I would hope this wasn’t the case because a lot of what I’m saying is coming from a book that is literally about the Design of OS by probably one of the biggest authors in that domain. I don’t know how you can say something like, the problem with hammers is no one uses them. The papers I’ve seen use them very often but at times can be tucked away into the supplemental data section (because of Journal word count requirements). Most of the papers I’m thinking about are Nutritional Epi studies where they track the exposure with FFQ’s. So that’s interesting that you have that experience.

I don’t know where I called Propensity Scores a CI method. It’s a tool that can be used for that though.
Propensity scores do not project away information (are you talking about when a match it not found? There are many matching techniques) that necessarily biases toward the null. The balancing property guarantees that, within strata of the propensity score, the distribution of measured covariates is the same between treatment groups. If implemented properly (with good overlap, diagnostics, and sensitivity analysis), treatment effect estimates are not automatically attenuated toward 1.0. Attenuation can happen if the model is mis specified or if poor overlap exists, but that is not inherent to the method. You can estimate average treatment effects, conditional treatment effects, or even explore heterogeneity after matching or weighting. Propensity scores do not force you to ignore outcome heterogeneity. They simply address imbalance in treatment assignment due to observed covariates.

I appreciate the engagement here.

f2harrell · September 7, 2025, 3:00pm

First of all average treatment effects are not very relevant as I’ve discussed in multiple blog articles on fharrell.com. The key word in your discussion of propensity scores is measured. It’s not a CI tool because it is only CI if the set of measured confounders equals the set of relevant confounders, which we have no way of knowing is the case when treatment is not randomized.

The reason I hopped on about misunderstandings is that the methods you listed are too quick to be used resulting in researchers to be too quick in concluding they do the job. Many, many OS researchers are too quick in assuming they have created comparable groups by temporarily forgetting the omitted confounders. The net result of all this is that analytical methods have been de facto window dressing to change the subject away from what’s needed for causal inference. And PS leads researchers to completely ignore outcome heterogeneity (not that PS forces this). I’ve seen this countless times.

All the examples I’ve seen are purposeful tightly designed and protocoled prospective patient registries for specific diseases or treatments. For example in cardiology there are large registries centered around coronary artery disease interventions. These registries have two things in common: they are well designed before any data collection begins, and they are not cheap. Attempts to do things with “free” data are at the root of a lot of BS going on today in OS.

DylanArmbruster · September 7, 2025, 3:18pm

f2harrell:

First of all average treatment effects are not very relevant as I’ve discussed in multiple blog articles on fharrell.com. The key word in your discussion of propensity scores is measured. It’s not a CI tool because it is only CI if the set of measured confounders equals the set of relevant confounders, which we have no way of knowing is the case when treatment is not randomized.

The reason I hopped on about misunderstandings is that the methods you listed are too quick to be used resulting in researchers to be too quick in concluding they do the job. Many, many OS researchers are too quick in assuming they have created comparable groups by temporarily forgetting the omitted confounders. The net result of all this is that analytical methods have been de facto window dressing to change the subject away from what’s needed for causal inference. And PS leads researchers to completely ignore outcome heterogeneity (not that PS forces this). I’ve seen this countless times.

DylanArmbruster:

I’ll have to think about that then. I don’t know how common 4 is in the papers I’m thinking of. Do you have an example of an OS that satisfied all 4?

All the examples I’ve seen are purposeful tightly designed and protocoled prospective patient registries for specific diseases or treatments. For example in cardiology there are large registries centered around coronary artery disease interventions. These registries have two things in common: they are well designed before any data collection begins, and they are not cheap. Attempts to do things with “free” data are at the root of a lot of BS going on today in OS.

I don’t recognize this definition of CI tools. Where are you pulling it from?

To me that doesn’t make sense if they understood what PS’s are. There’s nothing inherent to them that I understand that would lead you to forget about the possibility of an unmeasured confounder. There’s a great section in Paul Rosenbaum’s book where he talks about the idea of people who look comparable but may differ where he introduces Sensitivity Analysis. If you haven’t read his book yet, I highly recommend it. He just published a new book actually which I have not read yet. I think the authors that you’re talking about should read the book. It’s worth the money.

Do you have a specific paper in mind? I’ve got a collection of Observational studies that I think were worth the money spent so I would appreciate more.

s_doi · September 7, 2025, 3:33pm

This again, as I have said many times on this thread, is misunderstanding in translation. Sander Greenland is pointing out that random sampling and randomization are essentially two versions of the same process: both involve making a random decision about who ends up where. In random sampling, individuals are randomly chosen to be included or excluded from the study, while in randomization, individuals already in the study are randomly assigned to treatment or control. In other words, sampling is like random allocation into the study or not, and randomization is like random allocation into treatment or control within the study. The two are mathematically equivalent, but in practice they are not interchangeable because they serve completely different purposes: random sampling primarily targets external validity by making the study sample representative of the broader population, whereas randomization primarily targets internal validity by ensuring fair comparisons between treatment groups within the study.

s_doi · September 7, 2025, 3:51pm

No one you mention has claimed in this thread that random sampling is required to interpret a UI regardless of if this is observational (analytical) or experimental research. In addition no one you mention has claimed that random sampling equals random allocation. We only disagree on what to do to avoid UI misinterpretation or misuse e.g. the statistical model has serious violations yet one misuses the UI by thinking that it is still interpretable. There are many suggestions for dealing with such misuse on this thread.