How to interpret “confidence intervals” in observational studies

ChristopherTong · September 9, 2025, 5:29pm

Just to clarify, the lab settings where assay validation studies and similar regulated activities occur are supposed to be under audited quality assurance programs. Many research labs do not have (often because it is premature to define) adequate QC procedures, and often their own estimates of uncertainty can be wildly underestimated. Examples include

https://doi.org/10.1080/00401706.1972.10488878

https://doi.org/10.1038/548387a

Here it is up to the skill and imagination of the experimenter to ensure that all sources of variability (systematic and random) are accounted for. In the above cited cases, it is lab-to-lab variation that revealed there was more uncertainty that an individual lab thought.

ChristopherTong · September 9, 2025, 5:34pm

He was wrong but I’m not sure “sensitivity analysis” gets the credit. This is a peripheral point which doesn’t undermine the basic point that for comparative studies (where he preferred random assignment), Fisher did not make the compromise he was willing to make for random sampling, regarding the use of a random procedure to make all selections/assignments equally likely.

DylanArmbruster · September 9, 2025, 5:43pm

It should get the credit IMO:
https://academic.oup.com/ije/article/38/5/1199/667288

So he got it wrong. Following his heurstic then lead to a wrong conclusion. So we can’t be dogmatic here.

f2harrell · September 9, 2025, 8:00pm

Sensitivity analysis involves biased researchers choosing parameter settings to include in the simulation, and biased researchers and reviewers determining which parameter settings to emphasize in interpreting the results. On the other hand with Bayes you pre-specify a prior for a bias term or effect of an extra variable and its association with the exposure, and live with the consequences. Too often researchers rationalize results of sensitivity analysis. But even worse, all too often epidemiologists do not even do sensitivity analyses, especially in nutritional epi.

DylanArmbruster · September 9, 2025, 8:04pm

That is extremely uncharitable and not inherently true.

f2harrell · September 9, 2025, 11:36pm

Don’t leave it at that. Say why. The track record surrounding this in OS is not very impressive.

ChristopherTong · September 10, 2025, 4:53am

I see. Apparently the phrase “sensitivity analysis” is used differently in this literature. When I use this phrase, I mean fitting multiple models to the same data set, each with varying assumptions. The intent is to see how sensitive the model outputs are to changes in these assumptions, and reporting the findings. Cornfield et al (1959) did not do any model fitting of any data in their paper, did they? I’m not sure what “sensitivity analysis” means for you, but I see from the papers in the same special issue where this Greenhouse commentary appears, that it must have a different meaning that is compatible with your interpretation.

Regarding Fisher’s wrongness, it was not due to his heuristic. Cornfield did not deny the possibility of a hypothetical confounding factor being present due to the lack of randomization. Rather he showed that its RR must be at least as high as the one under consideration (lung cancer vs smoking), and none had been found or suggested. It’s not an airtight argument by itself, but in combination with the vast evidence reviewed in the paper, the collective evidence was deemed persuasive. As David Cox’s commentary stated, “the strength of their paper comes in substantial part from the wide variety of the evidence that they discuss, population rates, prospective and retrospective studies and laboratory work.” As Greenhouse noted, they ruled out, one after another, all the proposed competing explanations. This is what David Freedman called “shoe leather”.

I am glad you linked this paper, because Greenhouse cites 2 papers by Cornfield that provide a defense (of sorts) of drawing causal inferences from observational studies. He claims that the difference between RCTs and observational studies is not a difference “in kind” but only in degree. The two arguments he makes seems to be:
(1) Observational studies include natural experiments (he cites John Snow). We already made this point in this thread above when we found Dr. Greenland’s 1990 paper.
(2) Randomization does not guarantee no confounders in RCTs, you need additional procedures like blinding etc to ensure that both groups are handled in the same manner, except for the intervention. In the 65 years since 1959, this point is much better understood in the clinical trials community today.

The key passage in his 1954 American Statistician piece is:

It is a good deal more difficult to control variables in observational than in experimental material, so that the experimental method has unravelled and will continue to unravel mysteries before which uncontrolled observation would be powerless. But there is no difference in principle. There are no such categories as first-class evidence and second-class evidence.
There are merely associations, whether observational or experimental that, in a given state of knowledge, can be accounted for in only one way or in several different ways. If the latter, it is our obligation to state what the alternative explanations or variables might be and to see how their effects can be eliminated, while if the former it is equally our obligation to state so.
To distinguish between statistical association on the one hand and relationships that are established by experimentation on the other, without any reference to alternative variables that are present in one case but not the other, seems to us to be neither good statistics,
good science, nor good philosophy-though it may be good red herring.

I think this case is overstated. I would try to phrase his idea differently: there are design features that strengthen the internal validity of a study. These include: prospectiveness, concurrent control, random assignment of subjects to treatment, blinding of patients, blinding of evaluators, and so on. The more of these a study has, the stronger the internal validity. So maybe we don’t divide evidence into “first and second class” but simply tabulate how many good design features are there, and how many are missing.

In his 1959 paper he wrote:

the validation of experimental findings often requires their repetition under a variety of different circumstances.

And I think he implied that this too reduced the differences between RCTs and observational studies, since basically, everything needs triangulation, randomized or not. I can agree with that. In fact that was one of the points of my 2019 paper. And finally I agree with Cornfield on the value of observational studies. That was never in dispute in this thread. Rather, is there a valid interpretation of UIs in a nonrandomized study? I still don’t see a frequentist interpretation.

ChristopherTong · September 10, 2025, 5:13am

I am also tickled that, as George Davey Smith’s introduction to the special issue notes, all of Cornfield’s academic degrees were in history. Davey Smith wrote: “Perhaps one of the advantages that Cornfield had was his lack of any sustained formal training in either epidemiology or biostatistics.” I think I can cite this in support of my post 158!

f2harrell · September 10, 2025, 11:48am

Regardingi what is meant by sensitivity analysis in this context, it is equivalent to what Cornfield did: simulate an unmeasured confounder under constraints and see how strong and how related to measured confounders the unmeasured one has to be to explain away the associate or change its direction. The R rms package sensuc function does this for binary logistic and ordinary regression models. There’s no excuse for epidemiologists not to use this simple function, but they still don’t.

ESMD · September 13, 2025, 9:04pm

If there were ever a shift to subjective Bayesian methods for observational research, what types of priors would be most defensible? Specifically, do most scientists consider the “null bias” to be justified, or not? Do research consumers (i.e., other scientists, clinicians), by default, tend to discount the possibility that certain cause/effect relationships might exist, simply because accepting the possibility that they do exist would be inconvenient? And is this approach “unscientific”?

In previous posts, Chris asserted that observational researchers who habitually overinterpret their findings are effectively short-circuiting the process of scientific discovery. They are essentially trying to do an end-run around the effortful evidence triangulation that’s needed to inform scientific decision-making. Conversely, those who decry “nullism” seem to be accusing research consumers with clinical/scientific backgrounds of short-circuiting scientific discovery in the opposite direction (by completely disregarding the potential value of any effects with confidence intervals that cross the null).

This 2013 publication by Greenland and Poole discusses the issue:

This 2019 post from Andrew Gelman’s blog is also relevant, as are the comments that follow it:

https://statmodeling.stat.columbia.edu/2019/08/12/here-are-some-examples-of-real-world-statistical-analyses-that-dont-use-p-values-and-significance-testing/

These passages from the 2013 Greenland/Poole article stood out (bolding is mine):

“Our stand against spikes directly contradicts a good portion of the Bayesian literature, where null spikes are used too freely to represent the belief that a parameter “differs negligibly” from the null. In many settings we see, even a tightly concentrated probability near the null has no basis in genuine evidence. Many scientists and statisticians exhibit quite a bit of irrational prejudice in favor of the null based on faith in oversimplified physical models; Shermer…is a vivid example involving cell phones and cancer (see the Greenland…chapter for a discussion). This null prejudice also arises more subtly from confusion of decision rules with inference rules, and from adoption of simplicity or parsimony as a metaphysical principle rather than as an effective heuristic…

Cultural norms vary among research areas on this question. In psychological research, for instance, many hold that the null hypothesis is almost never true…We may be highly certain that any effect present is small enough so that it would make sense to behave as if the null were true until presented with sufficient evidence otherwise (a practice both Fisher and Neyman recommended); this is a heuristic use of parsimony. But a prior that a hypothesis is (for now) a useful approximation to the truth can lead to results quite different from using a spiked prior (which presumes there is evidence that the tested hypothesis is exactly true)…When there is no such evidence, a spike represents an unscientific faith in, or commitment to, the null, with no empirical foundation in most health and social-science applications…”

First consider this phrase:

“This null prejudice also arises more subtly from confusion of decision rules with inference rules.”

We should be clear about who’s responsible for this confusion. As long as observational research is trumpeted in clinical journals, any inappropriate conflation of “inference rules” with “decision rules” is not going to be the fault of clinicians. Researchers who publish observational research in a clinical journal know that their audience will be clinicians and that clinicians are in the business of making clinical decisions. Therefore, they are sending a very strong signal that they expect clinicians to use their findings for this purpose.

Next, let’s consider the following claim:

“…a prior that a hypothesis is (for now) a useful approximation to the truth can lead to results quite different from using a spiked prior (which presumes there is evidence that the tested hypothesis is exactly true)…When there is no such evidence, a spike represents an unscientific faith in, or commitment to, the null, with no empirical foundation in most health and social-science applications.”

I don’t understand all the possible ways to define priors using subjective Bayesian methods, in order to reflect various degrees of belief that an important effect is present. But, after reading many observational studies over many years, I certainly don’t believe that a researcher’s hypothesis should be considered credible by default, simply because it exists. Most observational researchers cite “prior evidence” to justify their studies. They are usually trying to build a case that a certain exposure “causes” an outcome (though they rarely admit their causal aims). However, the prior evidence in question very often seems, scientifically speaking, either very flimsy or utterly uncompelling. For example, given the abysmal approval rate for drugs that show promise in a test tube, there’s usually little reason to believe that a repurposed drug will, strictly by virtue of its mechanism of action, prove efficacious for a new indication that’s completely different from the one for which it’s approved. An investigator’s prior might be be optimistic, while other scientists’ prior might be skeptical.

The surreptitious yet widely acknowledged practice known as HARK’ing is another reason why researchers’ hypotheses should not necessarily be afforded much credibility. HARK’ing reflects hopelessly perverted academic incentives and disdain for the labour required for important scientific discovery. For readers with any relevant subject matter knowledge, it’s dead easy to detect and really aggravating when it’s detected. Several years ago, I encountered a publication in which every article in every issue involved painfully obvious data dredging followed by HARK’ing- clearly the editors were fine with this approach!

For the reasons noted above, I don’t agree that the mere existence of a researcher’s hypothesis renders a null-based prior unscientific. Similarly, I wouldn’t agree that an unusually fortunate occurrence represents a “miracle” just because a religious person tells me that it does. Going forward, observational researchers will need to come to grips with the fact that scientifically compelling evidence triangulation takes a LOT more effort than many (?most) believe.

In short, clinicians do indeed harbour a “null prejudice.” And this prejudice does represent a “heuristic use of parsimony.” But, rather than viewing this particular heuristic in a derogatory light (i.e., conflating it with intellectual laziness), we should, arguably, recognize it as both justifiable and necessary in the current observational research ecosystem. We have been inundated by such a huge volume of egregiously overinterpreted, (largely) poor quality research for so many years that we would be completely irresponsible NOT to adopt a highly conservative approach to dealing with it. The daily tsunami of groundbreaking observational research “discoveries” would keep us running in a thousand different directions if we were to take every weak observational effect seriously. Clinicians simply wouldn’t be able to function without the “null prejudice”! Unwillingness to grapple for three hours with the the substantive shortcomings of every published observational study isn’t a sign of intellectual laziness, but rather a justifiable and absolutely essential survival mechanism. Finally, and most importantly, accusations of unscientific prejudice also distract from the much bigger actual problem - the poor quality of much modern observational research. What’s needed now, to restore lost credibility, is adoption of a much more cautious and conservative approach to the presentation of observational research findings.

If anyone knows of any published observational studies in medicine that used subjective Bayesian designs and generated a spectrum of posterior distributions corresponding to a spectrum of priors, I’d be keen to see them. It would be interesting to see how much a “null prejudice” could actually affect a study’s results.

f2harrell · September 14, 2025, 11:55am

Very enlightening and thought-provoking messages. In opposition to your next-to-last paragraph, nullism is not reasonable because all null hypotheses are formally false (except for homeopathic remedies ) and because a spike represents a discontinuity of knowledge. But more importantly, you are indirectly thinking in a mode that implies that an unbiased estimate of an exposure effect is to be expected. Instead we need to account for bias because of the study design especially non-randomess of the exposure. And in a Bayesian mode the reasonable and traditional subjective prior to use as a starting point is one that places very low probabilities on very large effects of exposures, e.g. a normal prior on a log odds ratio with a somewhat small variance. The variance can be solved for by setting e.g. P(OR > 2) = P(OR < 1/2) = 0.05.

There are two ways that subjective Bayes would directly impact the analysis:

The prior on the exposure effect
The even more important prior on a bias parameter

The width of the posterior distribution for the exposure effect will be driven by the latter as exemplified here.

It is bias to worry about, not fine-tuning an exposure effect prior that acts as if the exposure effect could be estimated without bias. The problem to solve is akin to measurement error models.

ESMD · September 14, 2025, 1:52pm

Thanks for the feedback! I hope you don’t mind a few follow-up questions/clarifying restatements of what you’re saying.

I probably misconstrued much of the Greenland article because 1) I don’t really understand Bayesian methods as they are applied to RCTs; and 2) they seem even more complicated in the observational setting. I have only a hazy understanding of the idea of a “spiked” prior. I infer that a “spiked” prior is intended to reflect a piece of solid scientific knowledge (rather than a “metaphysical belief” as Dr.Greenland notes). He seems to be objecting to the idea of putting a spiked prior on the null in observational settings simply out of nihilistic habit, rather than because any substantive scientific reason exists for us to believe, a priori, that the effect will be null (i.e., “you can’t prove a null”).

So Dr.Greenland was primarily advising against using priors that reflect a statement like “we think, a priori, that it’s virtually impossible that there is any true relationship between the variables we’re studying” (?) You seem to agree with this position given the statement “all null hypotheses are formally false.” However, you also seem to be saying that use or non-use of a “spiked” prior is not going to be the most important issue when designing observational studies using a subjective Bayesian approach, since the prior on the “exposure effect” (spiked or not) would ultimately NOT be the main determinant of the width of the posterior distribution in this setting (?) Rather, In the observational setting, the width of the posterior distribution would end up primarily being a function of the prior on the bias parameter (?)

I’m mostly interested in understanding exactly how all the uncertainty inherent in observational research designs would get distilled into a particular subjective Bayesian prior. If I’m understanding correctly, you seem to be saying that the best way to achieve this is not to state that we are expecting no effect to be present, but rather that we’re not expecting a big effect to be present (?) So, would I be correct to infer that this approach would, in most cases (?), result in quite wide credible intervals and that the width of the intervals would, generally, prompt a more conservative interpretation of a study’s results than we have historically seen with a frequentist approach (i.e., neutralizing any temptation by researchers to claim “discovery” prematurely) (?) But, in the event that a strong underlying effect is actually present (e.g., smoking a lung cancer), this presentation would still prompt us to sit up and take note (?)

P.S.- re: “…all null hypotheses are formally false.” Scientifically speaking, I never really understood why this would be true, though I’ve read it in multiple places. I understand, intuitively, why it’s impossible to prove a null, but not why the null couldn’t be true… If I randomize 1000 patients to a vitamin B12 supplement versus placebo and monitor them for 10 years for heart attack, I would have no reason, scientifically speaking, to expect one arm to accrue more MIs than the other. In this case, I would expect the null to be true (??) I’m sure there’s some deep math/philosophy involved here that I’m in no position to understand, so I’ll just leave it at that.

f2harrell · September 14, 2025, 3:36pm

For a homeopathic remedy even the most modern instruments cannot detect even a single molecule of the “active” compound in the blood stream. So my prior is a spike at zero with probability 1.0. Otherwise I’m certain that any treatment that adds or changes molecules or leads the patient to alter their behavior in a health-related way will have an effect before you get to the 10th decimal place. A point null hypothesis has to involve an effect that is exactly zero, and with a huge enough sample size you could theoretically detect a tiny nonzero effect.

Also, a spike and slab prior envisions knowledge as a discontinuous process. It puts a huge probability (typically 0.5) on an exactly zero effect, and only a tiny probability on the effect being between 0 and 10^{-10}. This is illogical.

To understand the other pieces see if the following helps. No matter how skeptical the prior on the effect, even using a spike and slab prior as long as the probability of the spike at zero is < 1.0, the effect of the prior will wear off with large enough N. But design problems never wear off. The bias in the exposure parameter estimate can be informed by data, and thus lessened, if a small portion of the study involved randomization or some unassailable natural experiment. Without those there are no data to inform the bias, and the bias, when honestly acknowledged (which is not commonly done in frequentism), will dominate the width of the exposure’s posterior distribution as N gets very large. The amount of. increased posterior width due to uncertainty about the amount of bias is constant with respect to N.

An honest (about bias) Bayesian analysis is like a pre-specified sensitivity analysis where the pre-specification includes weights for all the possible values of the parameter (e.g., effect of an unmeasured confounder) you are varying in the sensitivity analysis. Bayes gives you a single answer (the posterior distribution) instead of debating various results of sensitivity analyses after-the-fact in an even more subjective way.

ESMD · September 14, 2025, 5:12pm

Thanks again- these concepts are pretty challenging to understand, so I’ll have to think about them more. I’m going to push back on just one point, because it often seems to be presented as a philosophically-justified statement of fact by statisticians, even though many biologic scientists might not actually agree with the assertion:

“Otherwise I’m certain that any treatment that adds or changes molecules or leads the patient to alter their behavior in a health-related way will have an effect before you get to the 10th decimal place.”

Dr. Greenland noted that many scientists seem to favour priors focused on the null and he took issue with this habit. Maybe I’m just not able to think abstractly enough, but I understand where they’re coming from…Lots of things are biologically impossible. Scientifically speaking, it’s not reasonable to assert that any substance we put in the body is “theoretically” capable of causing any physiologic outcome we can imagine (though patients make this assumption all the time). For example, if we give a drug that doesn’t cross the blood brain barrier to a person who is on no other medications, it’s not going to cause sedation. Similarly, there’s absolutely NO physiologic reason for us to believe that a B12 supplement might reduce the risk of future MI, no matter HOW many people we study…Asserting that we can’t exclude the possibility that B12 might reduce MI risk, just because we can’t conceive of the possibility or haven’t studied enough people, would be like asserting that we can’t be sure the tooth fairy doesn’t exist…While someone might argue that “we just haven’t seen her yet,” I’d bet my life savings that nobody’s ever going to see her. Would my null-focused prior here be unreasonable or unscientific?

Maybe widespread consensus around the plausibility of a prior spike on a null effect isn’t an overly important debate to have (?) I’m not sure (?) But I can see why the question could be contentious. Every time I hear a statistician say that a null effect is “impossible,” I feel a bit agitated, because the statement doesn’t resonate with my understanding of physiology and biology. And I’m sure I’m not the only one who feels this way…Hopefully a switch to subjective Bayesian presentations of observational results wouldn’t hinge on widespread agreement on this point (?)

f2harrell · September 14, 2025, 10:06pm

I just don’t believe in discontinuity of knowledge. But your blood-brain example is convincing. The B12 one less so. However the blood-brain example makes a case for not collecting data (spike at zero with probability 1), not a case for a spike-and-slab prior.

I see this as an interesting discussion but it’s not central to the best overall strategy for handling uncertainty that the topic started with. And I’ll be content with using continuous priors in the vast majority of cases.

MichiganWater · September 16, 2025, 10:29pm

I know Frank said that he finds your blood-brain barrier example convincing, but my first instinct is to disagree, and would like to explore this a bit, at the very least to help me understand where the disconnect exists.

Do you claim that if we gave a ‘non crossing’ drug to 1,000,000,000,000,000,000,000,000 subjects, that not a single molecule of the drug would make its way into the CNS? That is, you assign exactly 0.00000000000000000000% probability of even one molecule crossing for even one subject?

ChristopherTong · September 17, 2025, 4:09am

I recognize this argument; John Tukey wrote: “All we know about the world teaches us that the effects of A and B are always different–in some decimal place–for any A and B.” However, there are counterexamples. For example whether or not neutrinos have mass, however negligible, is of central importance in particle physics. Earlier the search for the Higgs boson involved a 3-valued hypothesis: a null with no Higgs particle, an alternate hypothesis with a Higgs, and a second alternate stating the data were not consistent with either hypothesis, as discussed
elsewhere.

However in the vaccine world, the null hypothesis is irrelevant. For example in 2020 the FDA stated that for a COVID vaccine to be authorized for emergency use, it must have a vaccine efficacy of at least 0.50 with a lower 95% Confidence Limit of 0.30. The null hypothesis was not even mentioned in the FDA guidance.

Unrelated, I still struggle to see how a Bayesian analysis with a single posterior (with an assumed bias parameter whose prior is put in by hand by the author) avoids the need for sensitivity analysis…but if you did do a sensitivity analysis, couldn’t you get any answer you want by changing the assumed bias prior (the “weights” right?). As noted above, the bias hypothetically could be in either direction. What makes the author’s choice of the weights privileged when there are no data with which to even guess them? I must be making an elementary mistake in interpreting all this.

f2harrell · September 17, 2025, 10:52am

Good points. Bayes is not equivalent to sensitivity analysis if you don’t “play” with the prior for the bias parameter, for example if the prior was pre-specified. Ideally it would be specified by an expert who completely understands the study design or lack thereof but who has no access to the data. On the other hand, finding the variance of the prior that “breaks” the analysis would probably result in a better sensitivity analysis than we currently do with a subjective grid of values.

ESMD · September 17, 2025, 11:46am

No, that’s not what I’m claiming and it’s not the relevant question. Rather, I’m pointing out that if an epidemiologist were to design an observational study, without input from experts in other fields, to examine whether a drug has the potential to “cause” a certain CNS effect, but that drug is known to either not penetrate the CNS or only to penetrate it to an unquantifiable degree in rare cases, it would be very reasonable for clinicians to expect that epidemiologist to select a null-based prior for the effect that he’s seeking.

I can accept that an effect of exactly zero might not be defensible simply due to imperfection of the tool used to measure the effect. But I wouldn’t accept an ideologically-based criticism of null-based priors because of allegiance to the notion that “everything the human body is exposed to can, theoretically, “cause” any outcome we can imagine.”