What do you think about limiting the use of the term “evidence” strictly to the discussion of relative fit of 2 models to data (as is done via the likelihood ratio) and refer to the information contained in the p-value as “surprisal” as @sander recommends? I’m beginning to think that would clear up much confusion. If I had to describe results with high surprisals, I’d use the language “sufficiently surprising if the assumed model is true.”

AFAICT, the surprisal is inversely associated to the likelihood ratio in the simple case of an asserted test hypothesis H_a and its complement H_\neg{a} in the sense that a high surprisal (low p) indicates that there (probably) exists a better model (model with higher likelihood) for the current data.

The term “surprisal” works when p-values are scaled to units of information relative to the assumed model, but I struggle to come up with a term when converting them to units that assume the alternative (ie. \Phi(p)).

I would never limit a vague term like “evidence” to model comparisons, especially when the bulk of Fisherian literature (including Fisher) refers to small p as evidence against its generative model - and it is evidence by any ordinary-language meaning of “evidence”, with no single comparator.

I would also modify your phrase to say “surprising if the assumed model had generated the data”, to emphasize the hypothetical nature of the model - in ‘soft-science’ modeling we rarely have “true” models in the ordinary meaning of “true” and so labeling them such invites confusion; but we can still do thought experiments with models to gauge how far off the data are from them, which is the source of P-values and frequentist arguments.

If p<1 there are always “better models” such as saturated models which fit the data perfectly, having p=1 and likelihood=maximum possible under background constraints. I think this illustrates how P-values and likelihoods are insufficient for model selection (at the very least we need penalization for complexity, as in AIC, and for implausibility, as with informative priors).

Finally, a point needing clarification: Why do you want to convert a P-value in the way you indicated? If you have an alternative comparable to your test model, take the P-value and S-value for the alternative to get the information against that alternative. And if you want to compare models, then use a measure for that purpose (which again is not a P-value; a model comparison needs to account for complexity and plausibility).

Blockquote
I would never limit a vague term like “evidence” to model comparisons, especially when the bulk of Fisherian literature (including Fisher) refers to small p as evidence against its generative model - and it is evidence by any ordinary-language meaning of “evidence”, with no single comparator.

While I find Richard Royall’s arguments for likelihood pretty strong, the distinction between “surprise” and “evidence” is made by others who work with Fisher’s methods.

The term “evidence” seems overloaded when it refers to p-values and likelihoods. It leads to unproductive discussions on how p-values “overstate” the evidence. If a distinction is made between surprise and direct measures of evidence via likelihood or Bayes’ Factors, we can prevent the p-value from being blamed for a simple mistake in interpretation.

Blockquote
Finally, a point needing clarification: Why do you want to convert a P-value in the way you indicated?

Kulinskaya, Morgenthaler, and Staudte wrote a text on meta-analysis that emphasizes the transformation of the p-value or t-statistic to the probit scale via the relation \Phi(p), where \Phi is the standard normal cumulative distribution function. Also writing in the Fisher tradition, they make the distinction between p-values as measures of surprise, and “evidence”, which they measure in probits, connecting Fisher and the Neyman-Pearson school of thought.

On another thread, you posted a link to a video where you, Jim Berger, and Robert Matthews discussed p-values. I’ve found all of the suggestions – the -e \times p \times ln(p) bound, the Bayesian Rejection Ratio, the S-value, and the Reverse Bayes’ method, all valuable.

I’ve given that much thought on how to use these techniques when reading not only primary studies, but meta-analyses where aggregate effect sizes are naively computed. I posted a re-examination of a recent meta-analysis of predicting athletes at higher risk for ACL injury in this thread:

I like everything that’s being said. And I like Sander’s distinction of model from the model. I like surprise or relative evidence as terms. In terms of how p-values overstate (relative) evidence, we need to footnote p-values thus: P-values are probabilities about data extremes under a putative generative model, not about effects.

I don’t think that the author of the link you referenced does a much better job. In his “correct interpretations” he still confuses that statistical significance is about an effect. It is not. It is about the data under an assumed model (with a particular restriction of that “effect” [which might be just zero, or any other value]). He could write that the observed data (or a test statistic) are statistically significant under the restricted model. It’s the first step into the trap to confuse the observed data or an observed test statistic (be it a difference or an odds ratio or a slope) with a model parameter. It’s not, and it cannot be, the parameter value that is statistically significant.

This misconception seems to lead the author to the wrong (well, at least problematic) statement that a statistically significant result would indicate that the statistic “calculated from the sample is a good estimate of the true, unknown [effect]” (emphasis mine). It is not. The quality of the estimation is not addressed by a test, or it is addressed only to that minimal requirement that the estimate is discernible from the value at what the model parameter was resticted to. This is no indication of a “good” or “bad” estimate. It’s simple good enough to decide whether we may claim that the value of the parameter is above or below the resticted value (what may be “good enough” for the purpose, though). The actual point estimate can still be way off the mark (something that could better be assessed by a confidence [or compatibility] interval.

Then I see another problem in the part “Detailed Reasoning”. Here the author more correctly states that the p-value is statistic calculated from the data under the restriction (although he is not very clear about that), but then he writes that the p-value is just an approximation. To what? As the data can be seen as realization of a random variable, the p-value calculated from such data can be seen as the realization of a random variable, let’s call this variable P. If the test is correctly specified, we know that the distribution of P is uniform under the resticted model, and that it is preferentially right skewed under the desired violation of that model. This is all we can say. Any observed p value is just a realization (not an approximation) of P - but we don’t know the distribution of P. A p close to zero is taken as evidence that the distribution of P is right-skewed and that the restricted model is violated / inappropriate to describe the data. Particularily, under the correctly specified model, the p values are expected to wildly jump around over the entire domain of P (between 0 and 1), no matter how large the sample size is. There is nothing it “approximates”. Only if the model is violated in the desired way, we might see the p-value as an approximation - of the value zero (this is the limiting value that would be approached for the sample size going to infinity).

I don’t think anyone has mentioned GRADE yet in this topic (may be wrong - I haven’t read through the whole thing). I’m ambivalent about it. Personally I don’t find language statments particularly useful - I’d rather look at the numbers - but other people don’t share that view. I’m interested in what people think about their recommended statements, which have been largely adopted by Cochrane, so are being seen by a lot of people.

Described in this paper:

I find some of them a bit problematic. For example, I don’t like using the word “probably” when we aren’t talking about probabilities. The biggest issue I have found is more in the way they are getting used; I’m seeing a lot of meta-analysis results with wide 95% confidence intervals (basically compatible with major harm and major benefit) described (using the GRADE system) as “probably makes little or no difference” which seems to me completely misleading.

I’m still giving this a first read, but it doesn’t seem as if any decision theoretic considerations influenced the recommendations. This is just more informal tinkering at the edges of common statistical fallacies that neither a frequentist or a Bayesian would condone.

Example: I’d substitute “precision” for “certainty”, as it seems they are confusing subjective probabilities for propositions, with data based frequencies but not placing them in the proper Bayesian context.

I’m also not a fan of these discrete levels. Different questions require different quantities of information, so there can’t be a single answer to the question.

I found the wording of this BMJ tweet interesting (emphasis mine):

This randomised controlled trial provides strong evidence for no statistically significant difference between traditional cast immobilisation and removable bracing for ankle fractures in adults

From a quick read of the actual article, this is confined to the tweet — the analysis found confidence intervals that were inside the pre-defined clinically relevant margin in both directions and concluded “Traditional plaster casting was not found to be superior to functional bracing in adults with an ankle fracture.”

Those interested in the original article can find it here:

Blockquote
The primary outcome was the Olerud Molander ankle score at 16 weeks; a self-administered questionnaire that consists of nine different items (pain, stiffness, swelling, stair climbing, running, jumping, squatting, supports, and work or activities of daily living).10 Scores range from 0 for totally impaired to 100 for completely unimpaired.

From the Statistical Analysis section:

Blockquote
The target between group difference for the primary Olerud Molander ankle score outcome was 10 points, consistent with other studies of ankle fracture management—this is the accepted minimally clinically important difference.814 The standard deviation of the Olerud Molander ankle score from previous feasibility work was 28 points, so we used a conservative estimate of 30 points for the sample size calculation.

This type of outcome metric will suffer from floor and ceiling effects. I’m not sure what can be learned from the study given this.

@f2harrell I am planning to write a letter to the editor regarding an RCT with the wrong interpretation of their data. Do you believe it would be appropriate to reference this discussion?

You might find this thread informative regarding letters to editor. Some have found “twitter shaming” to be more productive, but a letter could not hurt.

I struggle to understand the following and perhaps someone can explain:

If frequentist confidence intervals cannot be interpreted the same as Bayesian credible intervals, why are they identical in the setting of a flat prior?

Does this not suggest that values at the extremes of a frequentist confidence interval are indeed far less likely to be “true” than values closer to the point estimate? Is the point estimate therefore (assuming we truly have no prior information) not the estimate most likely to be true?

Blockquote
If frequentist confidence intervals cannot be interpreted the same as Bayesian credible intervals, why are they identical in the setting of a flat prior?

A flat prior is an attempt to express ignorance. A true informative prior is an attempt to express constraints that parameter values might take. But one Bayesian’s “informative prior” is another Bayesian’s “irrational prejudice.” There is no guarantee that Bayesians will converge when given the same information if the priors are wildly different.

Those are the philosophical arguments. In practice, Bayesian methods can work very well, and can help in evaluating frequentist procedures.

The whole point of a frequentist procedure is to avoid reliance on any prior. so the uniform prior is not practically useful. Think about Sander’s old post on credible priors:

Blockquote Sander: Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.

Blockquote
Does this not suggest that values at the extremes of a frequentist confidence interval are indeed far less likely to be “true” than values closer to the point estimate? Is the point estimate therefore (assuming we truly have no prior information) not the estimate most likely to be true?

It seems almost impossible for people to resist coming to that conclusion, but it is wrong. You are not entitled to make posterior claims about the parameter given the data P(\theta | x), without making an assertion of a prior. This becomes less of a problem when large amounts of high quality data are available; they would drag any reasonable skeptical (finite) prior to the correct parameter.

If we don’t have a good idea for a prior, I prefer the Robert Matthews and I.J. Good approach, that derives skeptical and advocate priors, given a frequentist interval.

His papers recommend deriving a bound based on the result of the test; I prefer to calculate both skeptic and advocate, regardless of the reported test result.

Note that the Bayesian-frequentist numerical equivalence (which as explained above should not even be brought up) works only in a very special case where the sample size is fixed and there is only one look at the data.