How to interpret “confidence intervals” in observational studies

ESMD · September 7, 2025, 4:42pm

I don’t think you and I are going to reach consensus here, Suhail, so I don’t really have anything else to add. You can have the last word if you want. Only one of us has a (very large) horse in this race as a producer of research…I’ll wait for another expert to state explicitly that he/she agrees with your position- let’s see what happens over the next few days/weeks. You seem to feel that agreement with your position already exists within this thread, but, unless my reading comprehension is abysmal, it clearly doesn’t. I trust that Drs.Harrell and Tong will correct me if I’ve misrepresented their positions or the positions of authors I’ve cited (and I will accept their corrections readily).

If it’s achieved nothing else, this thread has certainly thrown into sharp relief the reason why the presentation of clinical observational research findings hasn’t changed for 40 years.

DylanArmbruster · September 7, 2025, 4:46pm

ESMD:

s_doi:

This again, as I have said many times on this thread,

I don’t think you and I are going to reach consensus here, Suhail, so I don’t really have anything else to add. You can have the last word if you want. Only one of us has a (very large) horse in this race as a producer of research…I’ll wait for another expert to state explicitly that he/she agrees with your position- let’s see what happens over the next few days/weeks. You seem to feel that agreement with your position already exists within this thread, but, unless my reading comprehension is abysmal, it clearly doesn’t. I trust that Drs.Harrell and Tong will correct me if I’ve misrepresented their positions or the positions of authors I’ve cited (and I will accept their corrections readily).

Which one is more convincing to you. I would imagine just because three people agree with a proof, that wouldn’t be enough on it’s own for you to belief the proof is correct. Even if you don’t have the domain knowledge ( I assume you do) to have a strong agreement/disagreement, I’m sure you have your reservations for which one is more convincing.

ESMD · September 7, 2025, 4:50pm

Not sure what you’re driving at here. I think I’ve been very clear on which position I feel is most defensible. I started this thread in the hope that someone would provide a convincing argument that confidence intervals are somehow interpretable for observational research. I have not yet seen a convincing argument.

DylanArmbruster · September 7, 2025, 4:55pm

I’m responding to you saying “.. I’ll wait for another expert to state explicitly that he/she agrees with your position- let’s see what happens over the next few days/weeks. You seem to feel that agreement with your position already exists within this thread, but, unless my reading comprehension is abysmal, it clearly doesn’t .”. I took that as, you’re making the decision off of the consensus in the thread. I apologize if I misunderstood you. If I had to guess, I think you think that Inferential Statistics like CI’s, can’t be justified without random sampling / randomization of treatment.

I take it that what I and others have said are not convincing.

ESMD · September 7, 2025, 5:43pm

Yes, that’s correct.

Dylan- this is an argument for well-established experts to have among themselves, not a grad student and a physician (neither of us being experts). Given the incredible complexity of statistics, it’s not realistic to expect that either of us will come up with any startling insights here. The experts on this forum, and whose papers have been cited, have decades of applied methods and teaching experience. But they don’t always agree with each other- and it’s the implications of their deepest disagreements that give me such pause.

As a non-expert whose practice nonetheless depends heavily on clinical research, I try to avoid arguing the nuts/bolts of these issues with experts. Instead, I stick to asking methods questions that have been nagging at me for a long time and/or for which I perceive profound disagreement among experts- disagreements which can profoundly affect clinical practice. It’s important for clinicians to highlight to research producers the real-world ramifications of the things they do and their inability to reach consensus (an angle that many researchers clearly don’t consider with enough care).

On one side in this thread, we have an expert who produces observational research for clinical consumption (and presumably teaches observational research methods). On the other side, we have experts whose practice is not particularly invested in perpetuation of the status quo with regard to how observational research is presented. The former is pushing back (as expected) against the need for, and/or urgency of, significant change. The latter feel that change is long overdue.

DylanArmbruster · September 7, 2025, 6:37pm

ESMD:

DylanArmbruster:

I take it that what I and others have said are not convincing.

Yes, that’s correct.

Dylan- this is an argument for well-established experts to have among themselves, not a grad student and a physician (neither of us being experts). Given the incredible complexity of statistics, it’s not realistic to expect that either of us will come up with any startling insights here. The experts on this forum, and whose papers have been cited, have decades of applied methods and teaching experience. But they don’t always agree with each other- and it’s the implications of their deepest disagreements that give me such pause.

As a non-expert whose practice nonetheless depends heavily on clinical research, I try to avoid arguing the nuts/bolts of these issues with experts. Instead, I stick to asking methods questions that have been nagging at me for a long time and/or for which I perceive profound disagreement among experts- disagreements which can profoundly affect clinical practice. It’s important for clinicians to highlight to research producers the real-world ramifications of the things they do and their inability to reach consensus (an angle that many researchers clearly don’t consider with enough care).

On one side in this thread, we have an expert who produces observational research for clinical consumption (and presumably teaches observational research methods). On the other side, we have experts whose practice is not particularly invested in perpetuation of the status quo with regard to how observational research is presented. The former is pushing back (as expected) against the need for, and/or urgency of, significant change. The latter feel that change is long overdue.

I might be a bit naive/new compared to others. But I’ve done a lot of reading on Statistical Inference, Design of Experiments, and Design of Observational studies. From the Philosophical and Mathematical POV. While maybe I can’t sit at the adult table yet. I think I can still talk at the adult table from the kid table. Whether one of the adults choose to engage is up to them.
Even without a Masters in Mathematics, if an expert says A is B, but from my reading from text books say otherwise, I think I could make some noise. I think the question for the non-expert is whether they really can audit the claim of the expert.

I think me engaging here is good for my personal growth. I’ve the experts have the time to correct me I’m happy to have that experience. I just need to be convinced.

ESMD · September 7, 2025, 7:29pm

Agree completely. We’re in the same boat. I’m glad you’re here. And, even though Suhail and I aren’t seeing eye to eye on this issue, I’m also very grateful for his engagement.

s_doi · September 7, 2025, 7:43pm

I am going to say this one last time with emphasis and hope that this time my meaning will be clear. There is a need for urgent change with interpretation of the UI in OS, but not for the reasons you have given: lack of random sampling and/or randomization. Rather it is for reasons the experts have given. Let us be very clear: The interval in an analytical study (observational or experimental) can be interpreted without either of these two “requirements” you have insisted upon if the method of estimation (derived from the statistical model and its assumptions) has been judged to have a reasonable potential for comparability of the groups except for the comparisons of interest. There is no controversy on this, never has been and no vested interests are at play here.

Finally, I am not talking about descriptive studies here where the external validity fails without random sampling as in medicine we usually deal with outcomes of illness and reasons thereof.

ChristopherTong · September 7, 2025, 8:05pm

Chapter 10 of Modern Epidemiology (3d ed, Wolters Kluwer, 2008) by Rothman, Greenland, and Lash, states (p. 149) that

As a result of these extra sources of variation, and because of the weak theoretical underpinnings for conceptualizing study subjects as a sample of a broader experience, the usual statistical tools that we use to measure random variation at best provide minimum estimates of the actual uncertainty we should have about the object of estimation (Greenland, 1990, 2005b). One way to improve the quantification of our uncertainty is through bias analysis, which we discuss in Chapter 19.

The two citations to Greenland’s papers are

and
https://doi.org/10.1111/j.1467-985X.2004.00349.x

The first seems to be a paper entirely dedicated to the topic of this thread. The second seems to be related to the approach described by @f2harrell and related methods.

I skimmed the first paper from 1990. Greenland seems to make only one exception for a probabilistic interpretation of inferential statistics in observational studies: for “natural experiments” which occur when “nature or circumstance resulted in what is essentially random allocation and we knew this was so,” though such situations “are rare”.

Excerpts from the Conclusion of the paper:

Inferential statistics, such as P values, confidence intervals, and likelihood ratios, have very limited meaning in causal analysis when the mechanism of exposure assignment is largely unknown or is known to be nonrandom.

In causal analysis of observational data, valid use of inferential statistics as measures of compatibility, conflict, or support depends crucially on randomization assumptions about which we are at best agnostic and more usually doubtful. Among the possible remedies are: (a) Restrain our interpretation of classical statistics by explicating and criticizing any randomization assumptions that are necessary for probabilistic interpretations; (b) train our students and retrain ourselves to focus on nonprobabilistic interpretations of inferential statistics; (c) deemphasize inferential statistics in favor of pure data description, such as graphs and tables,(d) expand our analytic repertoire to include more elaborate techniques that depend on assumptions in the “agnostic” rather than the “doubtful” realm, and subject the results of these techniques to influence and sensitivity analysis.

ChristopherTong · September 7, 2025, 8:19pm

In light of Greenland’s 1990 paper just alluded to, strictly speaking the use of a random process in the study design is not mandatory, as he made an exception for “natural experiments”. (One of the data sets used by John Snow in his cholera investigation was famously a “natural experiment”…note that he did not use any p-values or confidence intervals anyway.) Also in my reply to Xie I gave another example of using a random model without a priori justification, but it was validated empirically post hoc (it was an example from physics, not statistics, so arguably not pertinent to this thread).

Other than such exceptions, I would agree with @ESMD that at least for a frequentist, random sampling or random assignment are mandatory for UIs to have the interpretation intended by frequentist statistical theory. Lacking such an interpretation, the charitable approach suggested by Rothman et al is that the UIs are just lower bounds on uncertainty, useful only as a crude guide rather than a rigorous attempt to evaluate uncertainty.

ChristopherTong · September 7, 2025, 8:36pm

As this is mostly off topic I will keep my comments limited.

The cases you bring up are of untrained individuals.

These are the patients who see a press release and consider changing their health behavior; but the press releases are happening because investigators don’t sufficiently emphasize the caveats and limitations of their work. This is the short circuiting. Greenland said it was happening back in 1990 (see first paragraph of his Conclusion) and it’s still happening.

If you notice many people using a hammer wrong, do we blame the hammer?

In post 48, @s_doi made a similar argument; you can see my response in Post 56.

Please note I never said the statistics course should be eliminated. I said that a study of the history of science would be more effective in teaching how to reason with data (post 158). Armed with this knowledge, students could then approach their stats curriculum with a more skeptical and cautious mindset.

Thank you for going beyond your formal education and reading on your own. Greenland and Freedman are excellent authors to read. Greenland has some good lectures on youtube as well.

f2harrell · September 7, 2025, 8:46pm

I’ll go on record as disagreeing with this.

There are two principle requirements of observational research that are needed in order to establish an association: random sampling and unbiasedness. These are roughly equally important. Without random sampling (except in special cases such as certain natural experiments) there is no basis for computing or interpreting a frequentist confidence interval or p-value, and without unbiasedness no trust should be placed in frequentist inferential quantities with or without random sampling. Bias is caused by unmeasured confounders, biased measurements, etc.

Without random sampling, a non-random-allocation study is essentially a one-off unrepeatable experiment that serves as a case study, i.e., as an example, but not as a conclusion-drawing technique. You might have as a valid conclusion “for the persons in this cohort, those who magically got treatment B were observed to live longer than those who magically didn’t get treatment B”.

We do compute CIs all the time but need to develop an honest interpretation, along these lines: “Were the inception cohort to be exactly recreatable and assuming the effect estimates are unbiased, the interval just quoted could have been interpreted as a confidence interval. As it stands, though, it should be interpreted as an ad hoc uncertainty interval with no known interpretation other than providing the interval that is compatible with effects that would not be rejected by an equally ad hoc statistical test. It is best for readers to assume that confidence intervals presented here underestimate the amount of uncertainty actually present because they assume effect estimates are unbiased and that the data model is correctly specified.”

In an RCT, random allocation makes inference about differences possible, but does not make confidence intervals for one treatment group valid.

DylanArmbruster · September 7, 2025, 9:19pm

f2harrell:

s_doi:

I am going to say this one last time with emphasis and hope that this time my meaning will be clear. There is a need for urgent change with interpretation of the UI in OS, but not for the reasons you have given: lack of random sampling and/or randomization.

I’ll go on record as disagreeing with this.

There are two principle requirements of observational research that are needed in order to establish an association: random sampling and unbiasedness. These are roughly equally important. Without random sampling (except in special cases such as certain natural experiments) there is no basis for computing or interpreting a frequentist confidence interval or p-value, and without unbiasedness no trust should be placed in frequentist inferential quantities with or without random sampling. Bias is caused by unmeasured confounders, biased measurements, etc.

Without random sampling, a non-random-allocation study is essentially a one-off unrepeatable experiment that serves as a case study, i.e., as an example, but not as a conclusion-drawing technique. You might have as a valid conclusion “for the persons in this cohort, those who magically got treatment B were observed to live longer than those who magically didn’t get treatment B”.

We do compute CIs all the time but need to develop an honest interpretation, along these lines: “Were the inception cohort to be exactly recreatable and assuming the effect estimates are unbiased, the interval just quoted could have been interpreted as a confidence interval. As it stands, though, it should be interpreted as an ad hoc uncertainty interval with no known interpretation other than providing the interval that is compatible with effects that would not be rejected by an equally ad hoc statistical test. It is best for readers to assume that confidence intervals presented here underestimate the amount of uncertainty actually present because they assume effect estimates are unbiased and that the data model is correctly specified.”

In an RCT, random allocation makes inference about differences possible, but does not make confidence intervals for one treatment group valid.

I think I’m a bit confused by what you wrote here. To make it less confusing for me, do you agree that CI’s are justified in RCT’s because randomization of treatment assignment is deployed.

DylanArmbruster · September 7, 2025, 9:25pm

ChristopherTong:

In light of Greenland’s 1990 paper just alluded to, strictly speaking the use of a random process in the study design is not mandatory, as he made an exception for “natural experiments”. (One of the data sets used by John Snow in his cholera investigation was famously a “natural experiment”…note that he did not use any p-values or confidence intervals anyway.) Also in my reply to Xie I gave another example of using a random model without a priori justification, but it was validated empirically post hoc (it was an example from physics, not statistics, so arguably not pertinent to this thread).

Other than such exceptions, I would agree with @ESMD that at least for a frequentist, random sampling or random assignment are mandatory for UIs to have the interpretation intended by frequentist statistical theory. Lacking such an interpretation, the charitable approach suggested by Rothman et al is that the UIs are just lower bounds on uncertainty, useful only as a crude guide rather than a rigorous attempt to evaluate uncertainty.

If random sampling or randomization of treatment is necessary for Frequentist stat theory, could you explain what you think they do that is necessary for Freq stat theory?

Going back to what I have been saying, OS methods (like Propensity Scores) can, in absence of randomization/random sampling be used to support the plausibility of model assumptions. Which I think is what Greenland is worried about.

Also, while p-values may have existed when John Snow was around, I really doubt he would have known about them. He also predates CI’s.

f2harrell · September 7, 2025, 10:07pm

CIs for differences between allocated groups is supported by randomization. CIs for responses from one of the treatment groups is not supported by randomization.

Just considering sampling, there is a relationship between having a legitimate sampling model and having a target of inference. With many OSs one cannot even define a target population.

This is not the case. It’s just a game that people play especially the non-top-tier scientists. The methods you’ve mentioned don’t make assumptions more plausible.

s_doi · September 7, 2025, 10:22pm

I am surprised. Sander Greenland seems to have no problem using an interval from an observational study as an example for his paper Semantic and cognitive tools to aid statistical science saying that “We will display some of these problems and recommendations with published results from a record-based cohort study of serotonergic antidepressant prescriptions during pregnancy and subsequent autism spectrum disorder (ASD) of the child”

The interval is nothing more than the range of test hypotheses that would not make the data look too unusual in repeated samples …

DylanArmbruster · September 7, 2025, 10:31pm

f2harrell:

DylanArmbruster:

I think I’m a bit confused by what you wrote here. To make it less confusing for me, do you agree that CI’s are justified in RCT’s because randomization of treatment assignment is deployed.

CIs for differences between allocated groups is supported by randomization. CIs for responses from one of the treatment groups is not supported by randomization.

DylanArmbruster:

If random sampling or randomization of treatment is necessary for Frequentist stat theory, could you explain what you think they do that is necessary for Freq stat theory?

Just considering sampling, there is a relationship between having a legitimate sampling model and having a target of inference. With many OSs one cannot even define a target population.

DylanArmbruster:

Going back to what I have been saying, OS methods (like Propensity Scores) can, in absence of randomization/random sampling be used to support the plausibility of model assumptions.

This is not the case. It’s just a game that people play especially the non-top-tier scientists. The methods you’ve mentioned don’t make assumptions more plausible.

“\CIs for responses from one of the treatment groups is not supported by randomization.”
Oh, why is that?

“This is not the case. It’s just a game that people play especially the non-top-tier scientists. The methods you’ve mentioned don’t make assumptions more plausible.”

Counter argument to this is PSM for confounding. You make an exception for natural experiments right. Why would it be hard to believe that we couldn’t construct an OS in a way to make those assumptions more plausible (in the absence of Randomization / Random sampling)

ChristopherTong · September 8, 2025, 12:04am

From Greenland’s 1990 paper cited above, Conclusion:

Randomization provides the key link between inferential statistics and causal parameters. Inferential statistics, such as P values, confidence intervals, and likelihood ratios, have very limited meaning in causal analysis when the mechanism of exposure assignment is largely unknown or is known to be nonrandom. It is my impression that such statistics are often given a weight of authority appropriate only in randomized studies. As an example of this improper weight, my co-authors and I have often presented confidence intervals with a comment that “the data would appear to be compatible with a relative risk ranging from [lower limit] to [upper limit].” Careful consideration of the points raised here might have led us to at least preface such a comment with the phrase “if the exposure had been randomized within levels of the controlled variables…” or else omit commenting on the inferential statistics altogether, since the data were probably compatible with a broader range of values than indicated by the limits.

ChristopherTong · September 8, 2025, 12:06am

“CIs for responses from one of the treatment groups is not supported by randomization.”
Oh, why is that?

Random allocation provides internal validity for the comparison between treatment groups. However the enrolled patients are a highly selected convenience sample, thus they cannot be even plausibly modeled as if they were a random sample from a population. There is no external validity whatsoever.

ChristopherTong · September 8, 2025, 12:33am

Also, while p-values may have existed when John Snow was around, I really doubt he would have known about them. He also predates CI’s.

Correct. And yet his methods were persuasive, without P-values and UIs. He did what modern authors of observational studies (like the one I cited above in PubPeer) failed to do - he put in the work to get more and better data of various kinds. I think what several of us are advocating here (eg, post 153) is that this hard work needs to be done for all attempts to derive causal inferences from observational studies. As Freedman said (post 187), today many authors substitute “intellectual capital” (regression modeling) for “shoe leather” because it’s cheaper than the hard work, and the journals and grant-making institutions reward it.

If random sampling or randomization of treatment is necessary for Frequentist stat theory, could you explain what you think they do that is necessary for Freq stat theory?

Frank answered this for the sampling situation in post 215. For RCTs, randomization addresses two matters: (1) it manges (but does not guarantee elimination of) confounding, making it less likely that systematic differences between treatment groups other than the intevention are present, and (2) for a frequentist, it makes the application of the mathematics of games of chance (probability modeling) plausible. Without the latter it’s a completely arbitrary procedure with no justification. Please see the 1990 paper by Greenland linked in post 209.