The most compatible value with the data within a confidence interval

arthur_albuquerque · April 13, 2021, 5:23pm

I am reading the fantastic book “Statistical Rethinking” by Richard McElreath, in which he says, “The posterior mean line is just the posterior mean, the most plausible line in the infinite universe of lines the posterior distribution has considered” (2nd edition, page 101).

This quote made me think about a piece written by Geoff Cumming in which he talks about frequentist confidence intervals. I don’t fully recall neither the reference nor the exact quote, but he says something like this: “There is nothing special about the values at the limits of a confidence interval, they are as plausible as the other values within this interval.”

I read this quote by Cumming around 2019, and since then, I have always thought that all the values within a confidence interval are equally compatible with the data. Now, reading McElreath’s quote, I believe I misinterpreted Cumming.

If someone could help me, I would like to clarify a few concepts:

Is the mean value from a confidence interval always the most compatible one with the data?
- If so, does such compatibility decrease when you compare values closer to the mean to values closer to the limits of a confidence interval? For example, imagine a mean value of 5 and 95% confidence interval [3 - 7]. Can I say that the value 4.5 is more compatible with the underlying data than the value 3.2?
Do these questions (and answers) apply to both confidence and credible intervals?

Thank you!

Pavlos_Msaouel · April 13, 2021, 7:31pm

Indeed, frequentist confidence intervals show the compatibility of the various values within and outside the intervals with the data at hand under the various background statistical assumptions. The point estimate is the one most compatible with the data. The compatibility of the values decreases the further away they are from the point estimate. Here is a good visualization.

Thus, the answer to your question is yes in that 4.5 is more compatible with the data than 3.2 in your hypothetical example.

Bayesian credible intervals can similarly be thought of as the effect sizes most compatible with the data under the various background statistical assumptions which in this case include the model and the prior.

arthur_albuquerque · April 13, 2021, 8:59pm

Thank you, Pavlos. So I completely misinterpreted Cumming back in 2019. The values at the limits of a confidence interval are the least compatible with the data.

Regarding Bayesian credible intervals, I believe I can state the same as above, correct?

Pavlos_Msaouel · April 13, 2021, 9:10pm

Regarding Bayesian credible intervals, I believe I can state the same as above, correct?

Yes, in many cases although it will depend on the shape of the posterior probability density plot. See here some examples.

R_cubed · April 14, 2021, 12:51pm

I think it would be helpful to emphasize a difference between compatibility (aka. “confidence”) intervals, and posterior credible intervals.

Statistician Richard Royall proposed there are 3 related but distinct questions in applied statistics:

What should I believe?
What should I do?
What do the data say?

Philosophers can (and do!) debate how well this classification works, but I’ve found it helpful.

We can answer question 3 indirectly by tentatively proposing hypothetical models (point estimates) and then calculating how well a particular data set fits by calculating a surprise statistic (p-value).

Essentially, I can assert a model without placing any probability on it, preventing the use of Bayes’ Theorem.

While it is true that for values at the ends of a CI are more surprising than the MLE, these differences are not used in the textbook decision procedures because a frequentist doesn’t place any particular prior probabiliy on a region of values. I can simultaneously hold that:

the ends of any particular CI are “more surprising” than the MLE. (retrospective view after data are seen)
I don’t place much credibility (without other information–ie. a prior) that this will hold up in repeated samples. (prospective view of possible future data).

So I think both quotes are correct, and do not contradict one another.

Most recent paper by @zad and @Sander on the proper use of p/S values.

Worth reading. @zad and @Sander mentioned this in the references to the above article.

Here is a good discussion by Larry Wasserman on the difference between Bayes and Frequentist procedures, illustrated by comparing credible intervals with “confidence” (compatibility) inervals.

arthur_albuquerque · April 14, 2021, 9:12pm

These are great insights. Thank you for the references.

I was reading an article that is heavily influenced by @zad and @Sander 's view, as I quote from its Methods section:

" We decided a priori to avoid dichotomizing results according to thresholds for statistical significance. Instead, 95% CIs are interpreted as the range of values most compatible with the data, models and underlying assumptions, and evidence against the null hypotheses of no interactions (no difference between associations in the two treatment groups) are presented as both P- and S-values, as recently suggested. 26-28" https://onlinelibrary.wiley.com/doi/10.1111/aas.13805

I have tried to stick with the “compatibility” concept while explaining my results whenever I wear my frequentist hat.

arthur_albuquerque · April 14, 2021, 9:15pm

Of note, Richard McElreath uses the “compatibility” term for “credible” intervals. It can get confusing sometimes while everyone is trying to run away from “confidence” and “credible”.

Sander · April 15, 2021, 4:01pm

I have a bit of discomfort regarding the above usage of surprise that parallel warnings against inversion of data and parameter probabilities:
The surprise measured by -log(p) is surprise at the observed data if the model from which the P-value p is computed were correct. To say “the ends of the CI [that is, the CL] are more surprising than the MLE” could be taken as inverting the -log(p) interpretation into a statement about parameter values being surprising given the data. That would be posterior surprise (the surprise you would have at finding out that the true parameter value was in the posterior tail(s) cut off the given parameter value), which would be defensible if the CI was a credible posterior interval. But that description is incorrect if it was only a frequentist CI.

Also, I think Poole’s three 1987 pieces on the P-value function are not dated in the least and are far better for applied interpretations than is Fraser’s 2019 article:
Poole C. Beyond the conﬁdence interval. Am J Public Health 1987a;77:195–199.
Poole C. Conﬁdence intervals exclude nothing. Am J Public Health 1987b;77:492–493.
Poole C. Response to Thompson (letter). Am J Public Health 1987c;77:880.
In fact I think Fraser’s article is downright wrong when it says:
“…if the p-value function examined at some parameter value is high or very high as on the left side of the graph, then the indicated true value is large or much larger than that examined; and if the p-value function examined at some value is low or very low as on the right side of the graph, then the indicated true value is small or much smaller than that examined. The full p-value function arguably records the full measurement information that a user should be entitled to know!”
-As AG might say: No, no, no! Any conclusion about the true value is a posterior inference which is not at all formalized by a P-value except in instances where that happens to numerically coincide with a posterior probability. And for forming inferences the user should be entitled to know far more than the P-value function, including every detail of the selection and measurement protocols of the study, and critical discussion of every assumption that went into computing the P-values (in my field, few assumptions if any will be exactly correct). In that regard, the sentence “The full p-value function arguably records the full measurement information that a user should be entitled to know!” is the epitome of academic nonsense, even if it is modified with the false conditional “given the model is correct”.

R_cubed · April 15, 2021, 6:01pm

I can’t disagree that surprise metrics are also liable to misinterpretation. But I’d ask this: If one wishes to condition on the data (ie treat data as given), why wouldn’t he/she simply use relative measures of evidence (ie. likelihoods)?

RE: Fraser article. That is a lot of food for thought for my weekend. Thanks. I had some concerns with that specific quote, but still found value in the article.

Last question: If you were going to give advice to journal editors on what statistics to accept, would you recommend they require:

likelihood functions,
p-value functions, or
both?

Sander · April 15, 2021, 8:41pm

R-cubed: “If one wishes to condition on the data (ie treat data as given), why wouldn’t he/she simply use relative measures of evidence (ie. likelihoods)?”
?: At the end did you mean “(e.g. likelihood ratios [LRs])”? There are other relative measures including posterior odds (POs) which are LRs only under near-flat priors.

That aside, I’m fine with relative measures as part of a toolkit that also includes absolute measures. As an editor I’d have no rigid requirements other than using reasonable tools reasonably for the task set out by the researchers (in practice in my fields that usually means a few P-values or S-values and CIs could suffice, but well used likelihood or Bayesian alternatives could do fine instead). The full graphs behind such numbers are very important for teaching and classroom data analysis, however, because we need to instill everyone with the idea that the numbers are merely road signs indicating where things are located on a much more detailed map of results.

Everyone understands that saying b1 is half as far away as b2 isn’t enough to tell you whether you can walk to b1 or have enough gas to reach b2. Relative stat measures seem to be tricky though because many writers treat them as if more straightforward and informative than P-values, an appearance which I think is an illusion produced by forgetting about the goals and background of the application and all the assumptions needed to compute what are often misleading statistical summaries.

A problem with all these statistical measures is that their size is extremely sensitive to the embedding model (the set of background constraints or auxiliary assumptions), call it A, which is never known to be correct (and never is “correct” in my work, it’s only a “working model”). With two parameter values b1 and b2 we can’t be sure how much a higher likelihood of b1 over b2 is an artefact of A. All we get to say is b1&A diverges from the data less than b2&A according to a certain divergence measure (the likelihood or some transform like the deviance). But it tells us nothing at all about how that depends on A.

The P-values for b1 and b2 give something closely related but different: They are the percentiles where b1&A and b2&A fall along the spectrum of divergences from models of the form b&A. These divergences may come from LR tests comparing b1&A and b2&A to bmle&A (the latter model being at the 100th percentile = maximum). We then see immediately not only which model diverges less from the data (in the sense of having a higher likelihood) but also where the models fall relative to the “best” on this spectrum. Then the S-values convert these P-values into measures of the information against b1&A and b2&A supplied by the respective test statistics.

But again, all these statistics tell us nothing about how the measures depend on A, a caveat that burdens all of statistics (even in particle-physics experiments). Some dependencies we can address using traditional modeling tools (like residual plots, model updating, etc.). Other dependencies are beyond our data’s ability to deal with (like selection bias); tools for that have only come to wider attention in the present century (in the form of sensitivity and bias analysis, nonidentified models, nonignorability etc.).

As footnote of sorts, here’s a belabored analogy:
LRs are like describing a pair of final exam results b1 and b2 in a math class (call that A) by saying b1 scored 25% higher than b2. Fine - but does that mean b1 scored at 5/40 and b2 scored 4/40, or that b1 scored 35/40 and b2 scored 28/40? Even if we knew the max score bmax was (say) 36, what would the scores relative to that tell us of their performance? We might feel less clueless if we saw where the scores fell in percentiles (distributions) for the whole class, which are analogous to P-values.

But then what if we wondered about what class performance meant for their real-world skills applying math as an RA on our project? What if the class A is totally abstract measure theory involving no computation or scientific word problems? Or what if instead it is numerical analysis? Or …? Without some context about the connection of the content of the class to the real-world work assignment we’d remain in the dark about the implication of class-performance statistics for answering application questions (not unlike some math statisticians have been about the real-world meanings of P-values, LRs, posterior probabilities etc., even though they know the underlying math in exquisite detail, can provide perfect math definitions, and can prove theorems in abundance).

Sander · April 15, 2021, 9:26pm

Made a correction to my original post (which botched up the description of posterior surprise) - now says it’s “the surprise you would have at finding out that the true parameter value was in the posterior tail(s) cut off the given parameter value, which would be defensible if the CI was a credible posterior interval”. Further corrections welcome.

R_cubed · April 15, 2021, 9:41pm

Blockquote
At the end did you mean “(e.g. likelihood ratios [LRs])”? There are other relative measures including posterior odds (POs) which are LRs only under near-flat priors.

That was precisely what I meant. I own a copy of Richard Royall’s monograph Statistical Evidence: A Likelihood Paradigm and he had convinced me that a distinction between questions of belief revision, choosing a course of action, and summarizing information from experiments (broadly defined to include observational studies) is useful.

I wanted your opinion on data reporting, as I agree with Royall that questions of “what the data say” should not, in principle, depend on the prior of the experimenter. I’d prefer reports that permit me to substitute my own prior or \alpha.

I don’t want to burden you any further as you have been very generous in giving your time and expertise in this thread until I have some time to think more about what you have posted.

For context, I’ve been thinking about this comment by Herman Chernoff on Bradley Efron’s paper “Why Isn’t Everyone Bayesian” I posted in another thread:

Blockquote
With the help of theory, I have developed insights and intuitions that prevent me from giving weight to data dredging and other forms of statistical heresy. This feeling of freedom and ease does not exist until I have a decision theoretic, Bayesian view of the problem … I am a Bayesian decision theorist in spite of my use of Fisherian tools.

Sander · April 15, 2021, 10:43pm

R-cubed: Regarding “questions of ‘what the data say’ should not, in principle, depend on the prior of the experimenter”…
Contrast that to my oft-stated position: Data say nothing at all; they are just bits in a medium or numbers in tables that sit there. If you hear them say something, seek professional help. In other words, for me the claim that data say something without a prior is akin to saying a stack of lumber is a house.

More precisely, I regard the quoted statement as based on a failure to take into account Box’s point (echoed many since including Gelman) that a data-probability model (which all approaches must use in some form) is itself a construct from prior distributions, albeit distributions for data given parameters. As such it can and often does vary drastically across analysts. Just consider the controversy over the covid surveys and IFR estimates of Bhattacharya, Ioannidis and colleagues. There were very different priors in play about how selective the analyzed data were, and many other conflicts over assumptions.

Anything the data and its derived statistics purportedly say about actual inference targets are interpretations of observers filtered through their assumptions about the data generator (crudely caricatured or hidden by an estimating or likelihood function), plus perhaps some information external to that function (crudely caricatured or hidden by a prior distribution). Sometimes, as in very simple tightly controlled experiments, everyone is confident enough in the data-generator assumptions to switch all attention to the parameter prior. Unfortunately that special case seems to have led to neglect of uncertainties about the data generator, turning most of statistical theory into a misleading model of rational inference in other settings (including all those I see in health and medical research).

-That said, I agree with
“I’d prefer reports that permit me to substitute my own prior or α”
which I would express by saying I want to use my own assumptions, external information and utilities in drawing any inference or decision from the data. Often though I also would wish I could redo the entire analysis using a different data model (even some of my own); it’s the lack of the data and time that makes that impractical.

Sander · April 15, 2021, 11:47pm

P.S. I think Royall’s book is a classic worth reading by any serious student of statistical foundations, as a serious attempt to place the likelihoodist viewpoint above all. Nonetheless I regard all viewpoints as seriously incomplete for research in general. Consequently, assigning primacy to any one formal viewpoint risks a dangerous mind trap, inducing cognitive biases that stem from missing what that viewpoint is blind to and overweighting what it emphasizes. That caution applies to frequentist, Bayesian, likelihood and any other statistical system if treated as a philosophy instead of as a toolkit.

P.P.S. Chernoff’s quote appears much closer to my position than Royall’s. But in light of my perspectivist/toolkit view I would not label myself as anything so specific as a Bayesian decision theorist - although I think the theory is another one every serious student should study (and was a bold label to take in his generation - he was born 1923 and still alive!). Of course I also think serious study of both Fisherian and Neyman-Pearson-Wald frequentist theory is essential, especially for deconfounding the two and relating them to their Bayesian analogs (reference Bayes for Fisherian, Bayesian decision theory for NPW).

R_cubed · April 16, 2021, 2:02am

Blockquote
But in light of my perspectivist/toolkit view I would not label myself as anything so specific as a Bayesian decision theorist

In other writings, he explicitly stated he didn’t consider himself a “Bayesian” as say Savage or Lindley might.

In this interview with John Bather A Conversation with Herman Chernoff (link) he said the following:

Blockquote
Although I regard myself as non-Bayesian, in sequential problems I feel it is rather dangerous to play around with non-Bayesian procedures … in the hands of less skilled people, non-Bayesian procedures are very tricky in sequential work

This quote is especially important in the context of your discussion with @f2harrell on foundational issues (Bayes-Frequentist equivalence) and the challenge of sequential design.

I actually own Chernoff’s monograph on optimal sequential design, and aside from a few sarcastic remarks on “religious Bayesians” – much of the theory of optimal design he worked out involves sequential probability ratio tests and Bayesian justifications of the procedures. Not much on \alpha adjustments.

I guess the problem I’ve posed for myself – given the Bayes-Frequentist relationship (ie. any coherent frequentist procedure is in the set of Bayes procedures), in what circumstance can a Frequentist (who prefers the \alpha adjustment approach) and a Bayesian agree to terminate a sequentially designed study?

Maybe the solution is as simple as second-generation p-values using pre-specified frequentist coverage (aka “confidence”) intervals, and avoiding \alpha adjustment entirely.

I still think it would be good to have a bit more theory to guide p-value adjustments. I’d say the problem is that any traditional \alpha adjustments is very sensitive to the design. The only guidance theory (ie. Bayes Theorem) gives is that p-values must be adjusted for multiple looks, and nothing more.

AndrewPGrieve · April 16, 2021, 7:33pm

R_cubed

I agree that we need to give serious thought to how we report the results of our studies. It is more than a little self-centred, particularly we Bayesians that everyone will be content, or accept, with “Our posterior”. The American econometrician Clifford Hildreth wrote a little known paper in the early 1960s enititled “Bayesian statisticians and remote clients” (Bayesian Statisticians and Remote Clients on JSTOR) in which he considered what potential parcels of information we might consider sending to interested customers/decision makers. This remains an issue.
from statistician to decision-maker or client are listed.