Change the range not the language on confidence intervals

Until I started visiting this forum, my intuition regarding these elementary frequentist concepts was less than zero. But going through your posts and references, along with the writing of @Sander has helped greatly. I don’t think I’m alone in this.

Looking back, it would have been easier for me to understand the obscure frequentist notion of “confidence” that is distinct from probability if it were linked to the likelihood concept. The likelihood function draws frequentist and Bayesian inference much closer.

Sander pointed out in this old thread:

2 Likes

The simplest answer to your question is that it doesn’t extend well to the multi parameter or multiple outcome cases. For example you’d have trouble quantifying evidence for a treatment reducing mortality by any amount or reducing blood pressure by > 10 mmHg (Bayes has no trouble with this).

2 Likes

Now you are talking my original language - I was once a physicist and my first ever undergraduate physics lecture was all about uncertainty (though perhaps not meant quite the same way). Though I expect it would take a generation to change the language in the medical literature unless we can get at the editors.
ps. given the hypothesis we care about is often only one sided in medicine, could we not say, eg. if a 97.5% interval was 2 to infinity then any test hypothesis between 2 and infinity will return a fail-to-reject P-value of ˃ (1 – 0.975)?

This is getting away from what researchers need. In the classical statistics world, one specifies a “coverage” probability and the procedure returns two limits. In practice it is much more common to desire the probability that the true unknown effect is within a researcher-specified interval that defines a clinically meaningful range.

The resource you provided is very good for the layperson interested in learning the basic concepts in statistics - thank you.

For the broader public, it can be even more concise I feel, such as: the wide range in the CI indicates that additional study is needed to gain confidence that the results in this study predicts the possibility of benefit / risk of harm in other patients. (If I have it right.)

Not exactly because we have to distinguish two things - the long run interpretation and the “realized interval” interpretation. When Neyman proposed this in 1937 or thereabouts he focused on the interpretation of the 95% interval as a coverage probability i.e. when the same type of study is replicated with sampling from the same population in the same exact way, 95% of such future intervals will include the population parameter. This is the long run interpretation and only holds before your study. Once the study is conducted and your interval created, this is now a single “realized interval” and this definition does not hold. Until recently, no one could interpret this except that the probability of inclusion of the population parameter was 0% (no) or 100% (yes) and 95% did not apply anymore. We proposed that just consider the 95% interval to be the range of test hypotheses (remember that each test hypothesis is a sampling probability model) under which your study data or more extreme has at least a 1-0.95 probability under any of these test hypotheses. The interpretation is that the interval is a range of sampling probability model means (keeping in mind that the mean of a sampling probability distribution is the population parameter) that are supported by your data at the threshold that you choose (e.g. 1-0.95). The 95% is therefore what central percentage of the distribution you choose as the definition of “not unusual” for your data.

Now once the above is understood, all is clear. This is why Fisher and NP had a great disagreement because Fisher realized that the interval was an inversion of his “test of significance” but had nothing to do with the NP “test of hypothesis“. So they could not simultaneously create an interval of this type yet claim that the “test of hypothesis” was an improvement over Fisher’s test of significance. After reflecting on their disagreement, I think Fisher failed to focus on the “realized“ interval and claim it for himself because he was too busy rejecting the concept of the test of hypothesis.

This is best called a compatibility interval a la @Sander . But is it clear? Hidden in the definition is a very particular statistical test to use in gauging incompatibility.

1 Like

Agreed; however we need to stop identification of P-values with “tests”. P-values are just measures of fit; their habitual identification with “statistical significance” and “hypothesis tests” and of intervals constructed from them with “confidence intervals” are what I see as the chief culprits in statistical overinterpretation and misinterpretation.

A P-value is but one measure of fit or compatibility (Karl Pearson) or consonance (Oscar Kempthorne) or consistency (DR Cox) [albeit it may be the oldest such measure, as it predates the notion of “significance test” by a good century or so; even the term “value of P” (Pearson 1900) predates Fisher’s testing interpretation by a few decades]. By looking at P-values across a parameter range we can construct a compatibility interval showing all target-parameter values that have p>0.05 when all background assumptions are held fixed.

Any interpretation beyond that (e.g., “Type-I error”, “power”, “confidence”) requires much added baggage that is in no way inherent in the P-value concept, baggage such as the demanding requirements of Neyman’s repeated-sampling set-up. In sum, that P-values can be used to construct statistical tests does not mean P-values should be viewed as tests. And certainly any in-depth analysis demands more measures of fit than just P-values, such as those described in your book!

7 Likes

There was a debate about naming the interval between Sander and Andrew in the British Medical Journal in 2019 where they each debated “compatibility interval” versus “uncertainty interval“. After reflecting on the argument between Fisher and Neyman-Pearson I believe that rather than compatibility, the purpose of the “realized interval” is uncertainty - i.e. uncertainty in the test hypotheses supported by the data. Can we say compatibility of the test hypotheses with the study data - we could perhaps but then is that really the core purpose of the interval? Sander stated that “Nonetheless, all values in a conventional 95% interval can be described as highly compatible with data under the background statistical assumptions, in the very narrow sense of having P>0.05 under those assumptions.“ These “values” are means of sampling probability models and therefore I could restate this as “Nonetheless, all values in a conventional 95% interval can be described as the range of test hypotheses supported by the data under the background statistical assumptions, in the very narrow sense of being less divergent from the data under those assumptions.“ That implies uncertainty (my view) rather than compatibility though I agree there may be a case for Sander’s compatibility. Terminology matters, and as Sander has stated, we should not co-opt words from English language whose everyday meanings are far from the statistical meaning - compatibility may be more so than uncertainty.

As I explained in the 2019 exchange with Andrew Gelman in the BMJ, and many places elsewhere, “uncertainty measures” is not justified and is even misleading in descriptions of P-values and interval estimates whenever there are uncontrolled or mismodeled sources of uncertainty, such as the nonrandom variation inherent in observational studies of treatments. And in ordinary English usage, ‘compatibility’ is a much weaker descriptor than ‘certainty’ or ‘support’: Observing p=1 for a model or hypothesis only says that the divergence measure used to compute the P-value did not deviate from what the model or hypothesis predicted. It does not say say the data leaves no uncertainty about the correct model, or that the model is supported by the data. That can be seen from the fact that any saturated model (e.g., a regression model filled with every possible product term of every order) will have p=1, even when the model is contextually absurd.

I view the continued push to interpret statistics like P-values as intrinsically measuring uncertainty or support as a blend of wishful thinking and of overselling of what formal statistics can provide us. Every situation in which one can justify inferential interpretations of P-values and interval estimates (such as “uncertainty”, “confidence”, and “credibility” applied to intervals) arises from successful design actions such as treatment randomization and masking. Statements about uncertainty, confidence, and posterior probability should be derived from such contextual features, including the documented mechanics of data generation. To use such evocative terms to describe what are only statistical relations of data to model predictions (such as P-values or interval estimates) is to slip into statistical descriptions assumptions that often have no justification in reality. In my view this evocation is a bane of accurate research reporting, one that sadly continues to be encouraged by some statisticians and researchers (who often have a clear investment in overinterpreting their data).

5 Likes

Thank you @Sander for clarifying your position. I agree with your central caution: P-values and interval estimates are purely model-conditional quantities. They quantify the relationship between observed data and model predictions, not the truth of a model, nor the total uncertainty inherent in a scientific question. When uncontrolled biases, misspecification, or nonrandom treatment assignment are present – as in most observational research – any inferential interpretation must be grounded in the design and data-generation context rather than attributed to the statistical procedure alone.

Where I differ is not on that caution, but on what the realized interval represents within its model-conditional framework.

You argue that describing intervals as “uncertainty measures” is misleading because they do not account for all sources of uncertainty, especially those arising from design limitations. I agree that they do not capture structural, causal, or model uncertainty. However, conditional on the specified model and assumptions, a 95% interval is precisely the set of parameter values whose associated test statistics are not rejected at the P=0.05 threshold (or whatever threshold we choose). In that sense, the interval is the inversion of a family of significance (or divergence) tests and defines a range of hypotheses that are less discordant with the observed data under the assumed sampling model. That is a statement about uncertainty – specifically, sampling uncertainty – albeit conditional and limited.

The distinction, then, may lie in the referent of “uncertainty.” I am not claiming that the interval measures total epistemic uncertainty about the world. Rather, it quantifies uncertainty about which parameter values remain viable given the data and the assumed model. That seems different from claiming “support” in a strong evidential sense. Indeed, I share your concern that language such as “supported by the data” can easily drift into evidential overstatement.

Regarding “compatibility,” I appreciate your point that it is weaker than “certainty” or “support,” and therefore less prone to over-interpretation. But I question whether it avoids misinterpretation in practice. In ordinary language, compatibility suggests coherence or agreement in a broad sense. Statistically, however, compatibility is defined narrowly: the divergence measure underlying the test does not exceed the chosen threshold. The everyday meaning may therefore also risk inflation beyond its formal definition. In that respect, I am not convinced that “compatibility” is inherently safer than “uncertainty.”

Your saturated-model example is important. A model with p=1 demonstrates that a P-value reflects only concordance between data and model predictions under a chosen discrepancy measure. It does not imply plausibility of the model in any contextual or scientific sense. I agree entirely. But this reinforces, rather than undermines, the view that intervals quantify uncertainty about parameters within a model, not uncertainty about whether the model is correct. A saturated model eliminates sampling variability in residuals; it does not eliminate uncertainty about whether the model is meaningful.

Perhaps the core issue is this: should terms like “uncertainty,” “confidence,” or “credibility” be reserved only for settings with defensible design features such as randomization? I agree that design justifies inferential interpretation. But even in well-designed randomized trials, interval estimates remain model-based constructions derived from hypothetical repetition. They quantify sampling variability under those assumptions. In observational settings, the same is true, though the interpretation must be more guarded. The limitation arises from the assumptions, not from the word “uncertainty” itself.

If anything, I worry that abandoning the language of uncertainty may obscure what intervals actually do: they delineate the imprecision inherent in estimation under a specified model. That imprecision exists whether or not the model is fully adequate. The remedy for overselling is not necessarily terminological substitution, but explicit articulation of assumptions and design features alongside the statistical summaries.

So perhaps a reconciliation is possible:

a) “Compatibility” accurately describes the narrow statistical relationship between data and model predictions.

b) “Uncertainty” accurately describes the variability of parameter estimates under the assumed sampling process.

c) Neither term should be interpreted as measuring total scientific uncertainty or evidential support.

d) Both require explicit acknowledgment of design quality and model assumptions.

Terminology does matter, but so does precision about what level of uncertainty we are discussing: sampling, model, causal, or epistemic. My concern is not to inflate what intervals provide, but to preserve clarity that their primary function is to express the limits of precision in estimation under stated assumptions.

1 Like

A problem here is that we are making proposals with no experimental evidence to back up our reasoning about what would be best to do within research realities. Hence we can only appeal to reasoning and experiential anecdotes, a combination which (as anyone involved in real-world applications can attest) is at best used with caution until more controlled study is available.

Here’s my anecdote for today: What you wrote is the same argument I got from professors when I was a student a half-century ago and complained that the statistics were failing to capture uncertainty: Their answer was always along the lines of “we all know that, but they do capture uncertainty within [or given] the model”; some would even mention this in lectures.The half-century since showed that this rationalization for applying “uncertainty” to such vastly incomplete measures of uncertainty did nothing to stop the treatment of conventional statistics as if they were capturing all uncertainty.

Not that all researchers made that mistake, but vast amounts of the literature proceeded as if p<0.05 or 95% CI excluding the null was grounds for having high certainty about an effect (and which editors used as a criterion publication), and also proceeded as if having p>0.05 or 95% CI including the null was grounds for having high certainty there was no effect. So by the beginning of the present century I could only conclude the kind of proposal you presented (which has a long history) had failed very badly in practice.

This conclusion led me first into promotion of multiple-bias analysis, which tries to enter into the analysis model all uncertainty sources seen as potentially important, as discussed in Greenland, S. (2005). Multiple-bias modeling for analysis of observational data, with discussion. Journal of the Royal Statistical Society, Series A, 168, 267-308. As some discussants noted however and it soon became clear, that approach required far too much effort and sophistication to ensure proper use by most researchers.

I thus turned to how to improve interpretation of conventional statistics, which was plagued by misinterpretations such as those catalogued in
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, supplement 1, 1-12 at Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations on JSTOR and at
https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf,
The authors of that diverged about how best to proceed, ranging from some arguing conservatively (along your lines) for retaining traditional treatments stated with more care, to some arguing for complete change to Bayesian methods.

One reform item they all could accept was resurrecting the original more modest compatibility interpretations of conventional statistics, as seen long before Fisher, Neyman, and others fused statistical reporting with the modern statistical theory of survey and experimental design. The premodern exemplars seemed content to use P-values and interval estimates in a role describing the relation of data to models and to hypothesis within models, as with Pearson’s use of compatibility when presenting goodness-of-fit P-values (a descriptor that Fisher also used at times)*.

In that premodern era, only a rarified elite participated in research and reporting, and they were under none of the intense time and funding pressures that became the norm in the second half of the 20th century. Yet even in that era we can find complaints about misinterpretation of large P-values (“nonsignificance”) as if those supported there being no difference*. Thus we should not be surprised that misinterpretations became the norm when, after WWII, science opened up to a flood of newly minted researchers subject to the pressures of modern academia, including pressures to downplay the uncertainties about their own results.

In sum, writers have been lamenting these misinterpretations for well over century, and the old advice to retain usage of “uncertainty given the model” appears to have done nothing to stop over-interpretations of conventional statistics. Hence my push to take the not-so-radical step of retreating to the more modest treatments of statistics found before the rigidly formalized approaches of Neyman and others took control of theory, and well before the caricatures of those approaches called “NHST” and “confidence interval estimation” took control of practice. How much this reform effort will improve reporting will only be seen if it is adopted in place of traditional treatments that insist on treating conventional statistics as uncertainty measures.

*For citations central to this history, see sec. 3 of
Greenland, S. (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice (with discussion). Scandinavian Journal of Statistics, 50(1), 1-35, https://arxiv.org/ftp/arxiv/papers/2301/2301.02478.pdf, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12625, discussion 50(3), 899-933, corrigendum 51(1), 425.

5 Likes

Yes, I can relate to this and makes a lot of sense. Will keep this in mind when this discussion comes up again in clinical settings or academia (and it does come up quite frequently!)

2 Likes

I would very much like to see a resurgence of multiple-bias modeling (see this for a simple Bayesian attempt). The fact that it is very difficult to do should not prevent us from setting the stage for better, more careful analysis. It will get researchers to for once ask the question of whether a treatment effect from observational data is even trying to estimate the same parameter as treatment effect from a randomized trial. By adding a second parameter (bias of treatment effect from observational data) one can learn just how severe a restriction you can put on the prior for the bias for the observational data to have an effective sample size that is even 1/4th as large as the randomized sample size. Uncertain intervals for the treatment effect will be far more honest. They wouldn’t account for all forms of uncertainty, but would account for some of the most important ones that are largely ignored today.

6 Likes