For the past months, I have been reading about how confidence/compatibility intervals aren’t helpful to interpret data from a single study. Thus, I am trying to switch to a bayesian view, along with credible intervals.
I understand that the range of a posterior probability (PP) 95% credible interval would be equal to the range of a 95% confidence/compatibility if one would have used a non-informative (flat) prior.
My question is: would it be appropriate to “mentally” transform all 95% confidence/compatibility intervals I get to read in papers to PP + 95% credible interval (from a non-informative prior)?
If so, I would avoid trying to interpret confidence/compatibility (which can be very counter-intuitive) and would switch to an interpretation of what is the probability of a certain outcome.
I believe this would be very especially helpful for interpreting RCTs in which the confidence/compatibility of a primary outcome has crossed the value 1 (P > 0.05).
What do you think?
Disclaimer: I am a medical student who is interested in data analysis but I have no background in mathematics/statistics. So I apologize if some of my thoughts don’t make sense.
The use of uninformative priors to interpret frequentist procedures in a particular case seems like it would only encourage mistakes in statistical interpretation.
A much better approach has been developed by Robert Matthews known as the Bayesian Analysis of Credibility (aka AnCred).
Essentially, given a frequentist CI, one derives the prior needed, after fixing the size of the posterior credible interval.
The following video is worth watching in its entirety, but to see Matthews present the technique, go to about 1:03:00 for his segment.
Blockquote
Could you please elaborate on the following sentence? What would be the shortcomings of using an uninformative prior?
To take the Bayesian POV seriously implies your probability statements are the odds you would be willing to wager. Since the “uninformative” prior will not apply to any particular case, it is not an adequate representation of the available information for the case at hand for a Bayesian.
It doesn’t even represent the beliefs of frequentists who use the procedure. The classical compatibility interval is an algorithm to select a decision rule, not a rule to update your beliefs.
Addendum: I forgot aboutt this Andrew Gelman blog post that discusses setting priors for clinical trials, with valuable input from Sander Greenland (who has commented below). He explains the problems with non-informative priors here.
Specifying a prior distribution for a clinical trial: What would Sander Greenland do? (link)
I don’t think this is quite as fruitful as you do, because this transformation only works in a special case where there is only one look at the data and the end of the study. Once you have more than one data look, the equivalence goes awry.
You have made this point on a number of occasions for good reason! I’ve had to give it considerable thought over the past few months.
I have a copy of Westfall and Young’s book Resampling-Based Multiple Testing, where they make the following statement:
Blockquote
Many commonly used frequency based multiple testing protocols are in fact based loosely on Bayesian considerations. (p. 22)
I’ve always wondered how a capable advocates of frequentist methods (@sander@Stephen or Bradley Efron as examples) would reply to your challenge, as they do not seem quite so bothered about the issue.
I presume they would say something like:
Blockquote
Yes, it is true that the Frequentist-Bayesian mapping is most easily demonstrated with a single data look, and attempts to extend this to multiple looks requires care. But far from refuting frequentist procedures, the Bayesian POV merely requires that the experimenter who adopts frequentist methods adjusts his p-values in order to maintain the Frequentist-Bayes interpretation. There exist situations where it is better to pay for the specification of the problem over time via p value adjustment, vs. spending a lot of time modelling prior to data collection.
In this post, Stephen Senn goes into the early history of how Fisher advocated for “significance tests” and p-values that could be interpreted in a Bayesian way without being as sensitive to the prior.
Blockquote
Now the interesting thing about all this is if you choose between (1) H_0: \tau = 0 \lor H_1: \tau \neq 0 on the one hand and (2) H_0: \tau \leq 0 \lor H_1: \tau \gt 0 or (3) H_0: \tau \geq 0 \lor H_1: \tau \lt 0 on the other, it makes remarkably little difference to the inference you make in a frequentist framework. You can see this as either a strength or a weakness and is largely to do with the fact that the P-value is calculated under the null hypothesis and that in (2) and (3) the most extreme value, which is used for the calculation, is the same as that in (1). However if you try and express the situations covered by (1) on the one hand and (2) and (3) on the other, it terms of prior distributions and proceed to a Bayesian analysis, then it can make a radical difference, basically because all the other values in H_0 in (2) and (3) have even less support than the value of H_0 in (1). This is the origin of the problem: there is a strong difference in results according to the Bayesian formulation. It is rather disingenuous to represent it as a problem with P-values per se.
Lot of food for thought. On the above quote, if referring to multiplicity adjustment, the reverse is true. It is necessary to not adjust for multiplicity to get a Bayesian-frequentist equivalence (and even then under narrow situations).
To take all this further there are a number of challenges where a Bayesian interpretation is possible and a frequentist interpretation isn’t. Example: P(treatment improves at least 3 of 5 outcomes).
Glad I’m seen as a capable advocate of frequentist methods. I hope I am also seen as a capable advocate of Bayesian methods and any method when applied with valid justification in the application context - they are all just toolkits whose differences have been exaggerated in the name of “philosophy” (the bane of applied statistics).
Regarding the frequentist vs. Bayes (mis)framing of the multiple-comparisons/looks issue which seems to be in play here: I believe I am with Senn in viewing it as a screwdrivers-to-hammers comparison in failing to properly map between the two families of tools and the questions being addressed. For the most recent coverage of my views (articles on which go back 30 years) see the open-access articles by Greenland & Hofman (2019), “Multiple comparisons controversies are about context and costs, not frequentism versus Bayesianism” https://link.springer.com/article/10.1007/s10654-019-00552-z
and Greenland (2021), “Analysis goals, error‐cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons” https://onlinelibrary.wiley.com/doi/full/10.1111/ppe.12711
Sorry, this may only be a matter of words, but: I would not say “It is necessary to not adjust for multiplicity to get a Bayesian-frequentist equivalence”. The way I’d put it, a Bayesian method does not “adjust for multiplicity” because the account for multiplicity is pre-loaded (as it were) in its framing of the question and model to address what a frequentist calls a “multiplicity problem”. As per their culture, frequentists differ only in allowing a zoo of incompletely specified questions and models, an issue that Leamer dealt with at length in his 1978 book. If we demand the same degree of specification for frequentists as typical Bayesians do as per their culture, and also extend Bayesian methods to their closure (including limits approaching improper priors) then as per Wald 1947-50 we can map all admissible frequentist methods to Bayesian ones and back again. We can also map between them using hierarchical embeddings as per Good (The American Statistician 1987, Vol. 41, No. 1, p. 92).
See my general response a moment ago and the 2019 and 2021 cites therein, and also earlier treatments for Bayes in multiplicity settings such as these two:
This is exceptionally helpful. Perhaps it is fair to say that there exist 1-1 transformations between frequentist and Bayesian quantities but frequentist quantities are not calibrated in an absolute manner (I grant you that because of the choice of prior nothing is really absolute) with respect to the decision to be made.
Blockquote
Glad I’m seen as a capable advocate of frequentist methods. I hope I am also seen as a capable advocate of Bayesian methods.
I had no intention of typecasting you as a member of what you rightfully refer to as “statistical religions.” I hope no offense was taken. It is hard to accurately describe your scholarship in statistics in 1 or 2 words and do it justice.
At least in the context of this discussion forum (and this thread in particular), anyone searching for your posts are most likely find your constructive dissent from @f2harrell and his very persuasive Bayesian advocacy. It is in that sense I came up with the description of “advocate” for frequentist procedures.
RE: Multiple comparisons paper (2021) – I had no idea you published a new paper on this. Thanks for posting the link!
Thanks - I think I know what you mean about “absolute calibration” if that corresponds to “under the stated test (target) hypothesis”. My response however is my version of an idea I see in Fisher’s defenses of P-values (his “significance levels”): There is an absolute calibration to a uniform distribution under all assumptions used to derive a valid P-value (the full test model, which includes the test hypothesis and all auxiliary assumptions). This calibration allows the P-value to be treated as a generic type of information measure for the divergence of the data from the test model in a particular direction (the direction specified by the test hypothesis). Following Box, this absolute calibration is for me an essential component of data analysis and complement to Bayesian modeling.
All this means is that the utility or informativeness of a P-value for the test hypothesis is directly proportional to the percentage of auxiliary assumptions which we are either willing to take as certain, or else have verified that P is insensitive (“robust”) to under the targeted test hypothesis. Not simple to be sure, but then again I see P-values as valuable, sophisticated tools that (like antibiotics) got vastly overdistributed and overused, generating strains of researchers and journal editors resistant to logic and common sense.
For details see “Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values”
Great. My point (in phrasing not as accurate as the math, ignoring the inevitable subtleties which nonetheless Wald and successors dealt with) can then be stated as: Optimal Bayesian decisions are in correspondence with optimal frequentist decisions, as seen when we map the Bayesian betting distributions to compound-sampling distributions and use the same utility function for both types of decision. So again, the apparent divergence between them comes down to different cultural attitudes about what are tolerable ways of falling short on the specifications required for such a mapping, and what are tolerable qualitative simplifications.
In the face of complex research realities including limitations of researchers and resources, both sides have to sacrifice quite a lot in practice. Those sacrifices show how short they both fall of their conflicting claims to ideological purity and contextual relevance, and how formal statistical theories function more as sources of social conventions which become dogmas rather than principled recommendations for data analysis. I think DeFinetti recognized that problem in this quote (which Stephen Senn reminded us of and I think we need to repeat often, including to students and ourselves):
“…everything is based on distinctions which are themselves uncertain and vague, and which we conventionally translate into terms of certainty only because of the logical formulation…In the mathematical formulation of any problem it is necessary to base oneself on some appropriate idealizations and simplification. This is, however, a disadvantage; it is a distorting factor which one should always try to keep in check, and to approach circumspectly. It is unfortunate that the reverse often happens. One loses sight of the original nature of the problem, falls in love with the idealization, and then blames reality for not conforming to it.” [DeFinetti, Theory of Probability vol. 2 (1975), p. 279]
I think the idea that a p-value threshold is needed for publication is really at issue period. If an experiment is worth doing, it’s worth reporting–irregardless of the p-value. As an extreme example, the Minnesota Coronary Experiment (1968-73) wasn’t reported since it wasn’t found to be significant. Years later data recovery led to a reevaluation of the diet-heart hypothesis (https://www.bmj.com/content/353/bmj.i1246). How many negative studies of the diet-heart hypothesis were never reported?
AFAIK, it has been known for decades that fixed level “significance test” (ie. 0.05, 0.01, 0.005, etc.) without taking into account the context is totally irrational. But that isn’t taught in any stats classes that are designed for individuals outside of a quantitative discipline.
Unfortunately, most research is done within the frequentist framework without understanding what the procedures are trying to accomplish. Understanding Bayes’ helped me appreciate frequentist contributions more.
In his book Large Scale Inference, Efron points out (p 48):
Blockquote
Classical Fisherian significance testing is immensely popular because they require so little from the scientist: only a choice of test statistic and specification of a probability distribution when the null is true. Neyman-Pearson adds the specification of a non-null distribution, the reward being the calculation of power as well as size.
Since bad stats can make it around the world faster than good stats can pull up his/her bootstraps, I’ve been focusing on how frequentist reports can be understood in a Bayesian way. The Bayesian Analysis of Credibility can help a research community derive the interval of justified belief in the relevant models of interest, and then use the machinery of Bayesian decision theory to perform experiments that answer the critical questions for stakeholders.
I just came across this letter criticizing exactly what I suggested in the original post. Interesting points and complementary to the discussion above.
Ladislas Nalborczyk, Paul-Christian Bürkner, Donald R. Williams; Pragmatism should Not be a Substitute for Statistical Literacy, a Commentary on Albers, Kiers, and Van Ravenzwaaij (2018). Collabra: Psychology 1 January 2019; 5 (1): 13. doi: https://doi.org/10.1525/collabra.197
That paper is a real eye-opener and I’m thankful to learn of it. The paper actually understates the disagreement between Bayesian and frequentist intervals, because if sequential testing is done one must adjust the frequentist interval for multiplicity and make it more conservative, whereas no adjustment is possible (or needed) for repeated Bayesian intervals.
I fully agree. May I ask what situations you think a frequentist approach is more suitable than a bayesian one? I feel that the questions that Bayesian analysis answers are simply more interesting and fruitful…
Many of my colleagues feel that using a mix of frequentist and Bayesian procedures is fruitful. The only reason I do that is because of the number of hours in the day. I am waiting to see an example where a frequentist analysis is more appropriate and useful than a Bayesian one. I’ve been waiting a long time.