Language for communicating frequentist results about treatment effects

Sander · November 18, 2018, 1:19am

Yes, until somebody convinces me that term is misleading for nontechnical readers relative to the ordinary English meaning of “compatibility” (compared to the extremely misleading evocations of “confidence” and “uncertainty”).

boback · November 18, 2018, 2:28am

We need some sort of ASA guidelines on the language for reporting/interpreting statistical results.

Sander · November 18, 2018, 2:43am

The “specific situations” cover every single problem I encounter in med and health controversies. Every linear, logistic and Cox regression I see has at least a partial-Bayes analog. Conversely every single Bayesian analysis is directly mappable to a random-parameter frequency model, which can be used to calibrate any proposed posterior computation via simulation from that model.

My impression is that, for some purely psychosocial reasons, extremists on both sides ignore or forget these mappings, which again apply to all the situations I think anyone here (struggling with mere terminology) encounters in real research (not flashy specialty apps in JASA or Biometrika).

My impression is that this failure on both sides is what split statistics into this crazy frequentist/Bayes divide. And that this split has been beyond destructive to thinking , writing, and teaching clearly (inseparable goals) about applied statistics. If the study is at all important and its goal is clear communication of what was observed and what it might portend, both frequentist and Bayesian interpretations need to be brought up. In doing so my ideal (not always attainable perhaps) is to develop an analysis model that can be defended as a useful summary of what is known, both for information extraction and prediction, and thus be acceptable from broad frequentist and Bayesian views. Meaning I listen closely to extremist criticisms of the other side even though I don’t buy their claims of owning the only correct methodology…

“But Bayes in total makes fewer assumptions than frequentist. That’s because frequentist requires a whole other step to translate evidence about data to optimal decisions about parameters. This requires many assumptions about sample space that are more difficult to consider than changing a Bayesian prior.”
That’s all just flat wrong in my view. Top frequentists I know say just the opposite, e.g., that Bayesian methods add crucial assumptions in the form of their priors and hide all the sampling-model complexities from the consumer in a black box of posterior sampling. Plus to top it off, Bayesian conditioning discards what may be crucial information that the model is wrong or hopelessly oversimplified (see Box onward).

I see all those problems, and I have seen some Bayesians in denial about or blind to them. Of course, I have seen some frequentists in denial about or blind to real problems like the subjective nature of typical sampling models in applications; that makes uncommited use of frequentist methods as subjunctive as uncommitted use of Bayesian methods (subjunctive, as in hypothetical, conditional, and heavily cautioned)

“I think our fundamental disagreement, which will not be fixable no matter how long the discussion, is that you feel you are in a position to question any choice of prior distribution, and you are indirectly stating that a prior must be ‘right’. I only believe that a prior must be agreed upon, not ‘right’, and I believe that even a prior that is a bit off to one reviewer will most often provide a posterior distribution that is defensible and certainly is more actionable.”
Sigh, you may be right it isn’t ‘fixable’ but I see that as stemming from the fact that I am not saying all that (I’ve heard it as you have - but not from me). I thus don’t see you as understanding what I see myself as trying to communicate (on the topic of this page): That I see the bad language framing and exclusivity of thought that has plagued all of statistics from some of the highest theory (e.g., Neyman, DeFinetti) to the lowliest TA for a “Stat 1 for nonmath people” course, as well as publications.

You noted these issues go deep - well it’s deeper than the frequentist-Bayes split, which I think has become an artificial obsession based on exclusivist thinking in mid-20th century “thought leaders.” One can only wonder what the field would have been like if (for all sides) common practice had stemmed from Good, instead of from Neyman or Fisher for frequentists and from Jeffreys or DeFinetti for Bayesians. Well thankfully one applied area did have Box to talk sense: That both sides talk about the same models of the world, the only split is that one is obsessing on Pr(data|model) and the other on Pr(model|data). But there are plenty applications (e.g., all of mine) where you need to consider both (albeit usually at different stages of planning and analysis).

Now, every stakeholder has a right to question your prior and your sampling model (your total model) as well as whatever loss function underlying a decision (explicit or implied, there is always one behind every decision) and whatever technique was used to merge those ingredients to produce P-values, posterior probabilities, or whatever. Isn’t our job as statisticians to help make all those ingredients as precise, transparent, and clear as possible, so that the criticisms can be anticipated and addressed? And to work with our collaborators to present a compelling rationale for our choices - ideally we’d try to make it compelling to anyone, frequentist, Bayes, or other interested party. And we’d do so knowing we may have made some choices that may look poor in light of facts we did not realize, leaving our results open to criticism on these grounds. We only err when we don’t take those criticisms seriously, for that is when we fail to walk back our inferences or decisions when this new information would call for that under the methodologic rules we claim to follow (e.g., deductive consistency).

For better or worse, however, to err is human and we also have to mitigate many errors handed down to us from authorities, like the misuse of English (e.g., using “null” for any hypothesis instead of a zero hypothesis), and the belief that the terms “frequentist” and “Bayesian” should refer to anything other than techniques (as opposed to philosophies) when applying statistics (as opposed to arguing about philosophy).

Sander · November 18, 2018, 2:53am

Try and get a campaign going with Ron Wasserstein, ASA Director (whom I’ve found very open to such issues and ideas) at ron@amstat.org. You can tell him I thought that would be a good use of ASA resources if the goal was to improve terminology, not to defend what’s in place.

severetester · November 18, 2018, 3:45am

I agree that P-values should be computed for several hypotheses and, as David Cox agrees, we can compute the probability the P-value would exceed (or be smaller than) observed under varying alternatives. Of course this is the basis for a severity assessment, but it doesn’t matter what it’s called.

severetester · November 18, 2018, 3:53am

I think the ASA should encourage all distinct views to weigh in and not set itself up as giving pronouncements on interpretations, especially when the input comes from a select group with certain aims, and particularly as there’s non-trivial disagreement (as is seen here). I agree with David Cox who maintains that the ASA (and RSS) should be discussion forums, encouraging many different perspectives, rather than dictating a single interpretation or position.

severetester · November 18, 2018, 4:36am

So you will keep P-values but banish attained level of statistical significance? (Erich Lehmann at times calls the P-value the “critical level,” and someplace, I believe, the “significance probability”–which would really confuse. For Cox, the (observed) level of statistical significance (the P-value) reflects the level of incompatibility or inconsistency with the null hypothesis. Nor does he think it necessary to add “observed”.

Sander · November 18, 2018, 5:32am

You mean banish the term “attained significance level”? The quantity is useful (e.g., for decision rules) but Fisher had a less misleading term for it: P-value. So yes, like “the aether” and “consumption” (for tuberculosis) any statistical term with “significance” needs to be retired to historical notes.

Those that institutionalized this destructive “significance” terminology were brilliant conceptually and mathematically gifted, and their students were at least the latter (it was a tiny elite that made it into their tutelage). Those authorities could be pretty sloppy with words and still not be misled by the methods, since they were good at mapping between math and context. For example, Fisher and Neyman knew that any decision parameter like alpha should depend on context.

But the vast majority of users and readers have only the words to hang on to - the math is and always will be a contentless abstraction for them. Our statistical heroes past, no matter how brilliant, were simply unaware of this problem (and really, how could they be?). That does not mean there is no problem: There obviously is a crisis in understanding the meaning of statistics. In dealing with it we need to retire bad traditions, including bad terminology, and not replace them with bad new ones. That’s part of statistics reform.

As for “observed P-value”, “observed” is needed to distinguish the one-dataset Fisherian P-value p that Cox calls “significance level” (a conditional data probability) from the Neymanian P-value P which is a uniform random variable (e.g., see Kuffner & Walker, TAS 2018). Like “significance level”, another example of different authorities using the same term for different concepts. No wonder researchers began mixing up p and alpha, and some statisticians forgot that a P-value degrades information if not calibrated to (uniform under) the assumptions it is computed from (e.g., posterior predictive P-values).

zad · November 18, 2018, 7:24am

I don’t think we need guidelines as much as a few resources that discuss this more in depth with the problematic usage and history of the commonly used terms/phrases.

I do wonder if Ron would be open to a collaboration (it doesn’t have to ASA-related) but something to create a resource where common usage of phrases/terms is discussed.

I appreciate and believe much of the discussion here has been incredibly insightful but I believe it’s also incredibly restricted and only visited by those who are already interested in proper statistics. Some of the problems that have been brought up here, such as the damage of the usage of the term “significance” and “confidence” have not really been brought up elsewhere.

For example, the ASA statement and its supplements, including @Sander’s, discuss some of the problems with dichotomization and misinterpretations, and Sander’s recent papers on P-value behavior also discuss some of the discrepancies in the definition of “significance” etc, but it doesn’t seem an argument (similar to the one in this forum by @f2harrell or @Sander) on why usage of these terms is detrimental to statistics and science has been fleshed out in any sort of paper that can actually be referenced.

If there is one, I would appreciate a link. I think this is something worth thinking about because if I were to go on a universal resource like Wikipedia and attempt to make some edits to add the phrase “compatibility interval” to the entry on confidence intervals, it is very likely that my suggestions would get rejected without some sort of authoritative reference.

zad · November 18, 2018, 7:30am

I can get behind these suggestions and completely avoid the term “significance” (even in a hypothesis testing framework) by sticking to “xx hypothesis is/is not rejected by our prespecified criterion” and behind “compatibility intervals.” I’ve edited my posts near the top of the forum to reflect these thoughts.

EvZ · November 18, 2018, 11:15pm

I’m very late to this party, but still want to offer my 2 cents.

95% confidence does not seem to be such a useful property because it refers to the procedure rather than the observed data. So it’s difficult to fault researchers for interpreting confidence intervals as if they are credibility (or posterior) intervals. What else are they supposed to do with them?

Better terminology is certainly important, but I think it’s even more important to make better intervals.

f2harrell · November 18, 2018, 11:26pm

Not at all with my collaborations over the years. A fundamental reason why there is no 1-1 equivalence between Bayes and frequentists is many situations is that there is no agreement among frequentists in how to handle multiplicity. Areas in which 1-1 equivalence is impossible or at least very difficult are in sequential testing and other situations with stochastic N, and multiple outcomes. For multiple outcomes, e.g., a Bayesian posterior probability that a treatment makes a major improvement on two specific outcomes, a moderate improvement on one other target, or any improvement in mortality, it’s easy to say that there is an equivalence but no one has shown how to derive it.

That’s not correct in my opinion. The big new assumption of Bayes is the prior, and prior is in plain site and criticizable. You are avoiding decision making and not recognizing how concordant Bayesian posterior inference is with decision making. Sample space considerations are very hard to be transparent about, and often very hard to take into account. The idea that frequentists can make coherent decisions with calculations about data is a hard sell. This whole argument is caused by fundamental problems in indirect reasoning.

I like everything else you said.

Let’s stop the Bayesian debate at this point and try to come to the best language for communicating and interpreting frequentist results that might be used in the medical literature. Your feedback on the “more agnostic” edits I made yesterday would help us all move forward in getting more reasonable language.

zad · November 19, 2018, 12:01am

I’d suggest you reread some of the posts in this discussion.

Sander · November 19, 2018, 5:13am

The Bayes-frequentist debate enters the terminology debate especially when people start doing uncritical promotion of one or the other, because as I wrote before the dichotomy is disconnected from the details. The claims you put forth not only ignore abundant frequentist criticisms, but also disregard the vast diversity and divergences of methods and “philosophies” (or narrow dogmatisms) falling under the headings of “frequentist” and “Bayesian”. Fisherian frequentism in its most likelihoodist form is for all practical purposes hardly distinguishable from reference Bayes, and both of those hardly resemble the extremes of Neymanian hypothesis testing or operational subjective Bayesianism. Furthermore, the latter extremes are united by many connections (e.g. admissibility of Bayes estimators).

I’ll continue to state that various claims of this or that being superior are misleading if they are not circumscribed by the purpose and context of the analysis. Claims that Bayesianisms (whatever their form) are uniformly desirable are like claiming sports cars are simply the best cars; fine if you are on the autobahn, but see how far that gets you on a rutted dirt road. Likewise, where a Bayesian method and a frequentist method diverge, it can be the Bayesian method which is the one leading us into disastrous error - that was noted by Good long ago even though he was philosophically Bayesian.

What some of us know now is that Bayes can break down without the prior being wrong in any obvious public sense - see the literature on sparse-data inference; as I’ve cited before, there is a lovely summary on page 635-6 of Ritov Bickel et al. JASA 2014. It’s all just a manifestation of the NFLP (No-Free-Lunch-Principle): Every advantage you list (when real) is bought by some disadvantage elsewhere.

So again, regarding terminology, both “frequentist” and “Bayesian” need to be reoriented to talking about methodologic tools, not philosophies or dogmas or dictates. When we do that we can find that we can use P-values to check Bayesian analyses, Bayesian posteriors to check compatibility intervals, etc. By doing both we can see when conventional P-values aren’t close to posterior probabilities and their corresponding compatibility intervals aren’t close to posterior intervals, because we have them both before us at once to compare. And we can then see why we should reject highly subjective posterior-evoking terms like “confidence interval” or “uncertainty interval” to refer to the range between level points on a P-curve, and why we should reorient frequentist terms and descriptions to emphasize the complementarity of frequentist and Bayesian tools.

EvZ · November 19, 2018, 9:19am

Zad: Researchers tend to interpret CIs as if they’re credibility intervals. And that’s correct if one assumes the uniform (flat) prior. So for this reason, and because of certain invariance properties, the uniform prior is considered by many to be objective or non-informative. The trouble is that the uniform prior it’s actually very informative both for the sign and magnitude of the parameter. When estimating a regression coefficient (other than the intercept), that’s a real problem because zero has the special importance of “no effect”.

I’ve written a paper about this issue which was just accepted by Statistical Methods in Medical Research. It’s at https://github.com/ewvanzwet/default-prior

If the parameter you want to estimate is a regression coefficient (other than the intercept) in a bio-medical context, than that’s prior information in itself. I sampled 50 random papers from MEDLINE to estimate a prior that can be used as a default in that situation.

Now suppose you have an (approximately) normal and unbiased estimate B of a regression coefficient with standard error se. The default prior that I got from MEDLINE gives B/2 +/- 1.96*se/sqrt(2) as a suitable 95% default credible interval.

The example of Prof. Harrell has a difference of mean blood pressure, which can be viewed as a regression coefficient in a bio-medical context. The p-value is 0.4 and the 0.95 CI is [-5, 13]. So, the estimate is 4 and its standard error is 4.5. The default credible interval I’m proposing would be [-4.2, 8.2]. The default posterior probability that the true difference is positive would be 1 - pnorm(0,B/2,se/sqrt(2))=0.73.

simongates · November 20, 2018, 8:52am

I would say it isn’t correct (see earlier posts). The interpretation is different even if the limits of the interval are the same. Things don’t become the same just by being the same size. My rabbit weighs the same as my cat.

GAR · November 20, 2018, 9:16am

That notion of compatibility seems very appropriate and definitely much more fitting than confidence or uncertainty. Do you discuss it in an article or a blog post?

EvZ · November 20, 2018, 9:53am

would say it isn’t correct (see earlier posts). The interpretation is different even if the limits of the interval are the same. Things don’t become the same just by being the same size. My rabbit weighs the same as my cat.

Simon: In my experience, researchers tend to interpret the “0.95” of a CI as a degree of certainty in much the same way they would interpret the “0.5” of a fair coin toss. Don’t you think so?

simongates · November 20, 2018, 11:20am

I agree that people interpret it that way. I’m just saying that the interpretations of frequentist confidence intervals and Bayesian credible intervals are not the same, even if they are the same size.

EvZ · November 20, 2018, 12:02pm

True. But my point is that when people interpet CIs that way, then they’re (implicitly or explicitly) assuming the flat prior. However, that prior is often completely unrealistic. In particular, this is the case when estimating a regression coefficient (other than the intercept) in the context of the life sciences.

Since people are always going to interpret interval estimates as probability statements, I think that those intervals should be based on reasonable priors.