I don’t think anyone has mentioned GRADE yet in this topic (may be wrong - I haven’t read through the whole thing). I’m ambivalent about it. Personally I don’t find language statments particularly useful - I’d rather look at the numbers - but other people don’t share that view. I’m interested in what people think about their recommended statements, which have been largely adopted by Cochrane, so are being seen by a lot of people.
Described in this paper:
I find some of them a bit problematic. For example, I don’t like using the word “probably” when we aren’t talking about probabilities. The biggest issue I have found is more in the way they are getting used; I’m seeing a lot of meta-analysis results with wide 95% confidence intervals (basically compatible with major harm and major benefit) described (using the GRADE system) as “probably makes little or no difference” which seems to me completely misleading.
I’m still giving this a first read, but it doesn’t seem as if any decision theoretic considerations influenced the recommendations. This is just more informal tinkering at the edges of common statistical fallacies that neither a frequentist or a Bayesian would condone.
Example: I’d substitute “precision” for “certainty”, as it seems they are confusing subjective probabilities for propositions, with data based frequencies but not placing them in the proper Bayesian context.
I’m also not a fan of these discrete levels. Different questions require different quantities of information, so there can’t be a single answer to the question.
I found the wording of this BMJ tweet interesting (emphasis mine):
This randomised controlled trial provides strong evidence for no statistically significant difference between traditional cast immobilisation and removable bracing for ankle fractures in adults
From a quick read of the actual article, this is confined to the tweet — the analysis found confidence intervals that were inside the pre-defined clinically relevant margin in both directions and concluded “Traditional plaster casting was not found to be superior to functional bracing in adults with an ankle fracture.”
Those interested in the original article can find it here:
Blockquote
The primary outcome was the Olerud Molander ankle score at 16 weeks; a self-administered questionnaire that consists of nine different items (pain, stiffness, swelling, stair climbing, running, jumping, squatting, supports, and work or activities of daily living).10 Scores range from 0 for totally impaired to 100 for completely unimpaired.
From the Statistical Analysis section:
Blockquote
The target between group difference for the primary Olerud Molander ankle score outcome was 10 points, consistent with other studies of ankle fracture management—this is the accepted minimally clinically important difference.814 The standard deviation of the Olerud Molander ankle score from previous feasibility work was 28 points, so we used a conservative estimate of 30 points for the sample size calculation.
This type of outcome metric will suffer from floor and ceiling effects. I’m not sure what can be learned from the study given this.
@f2harrell I am planning to write a letter to the editor regarding an RCT with the wrong interpretation of their data. Do you believe it would be appropriate to reference this discussion?
You might find this thread informative regarding letters to editor. Some have found “twitter shaming” to be more productive, but a letter could not hurt.
I struggle to understand the following and perhaps someone can explain:
If frequentist confidence intervals cannot be interpreted the same as Bayesian credible intervals, why are they identical in the setting of a flat prior?
Does this not suggest that values at the extremes of a frequentist confidence interval are indeed far less likely to be “true” than values closer to the point estimate? Is the point estimate therefore (assuming we truly have no prior information) not the estimate most likely to be true?
Blockquote
If frequentist confidence intervals cannot be interpreted the same as Bayesian credible intervals, why are they identical in the setting of a flat prior?
A flat prior is an attempt to express ignorance. A true informative prior is an attempt to express constraints that parameter values might take. But one Bayesian’s “informative prior” is another Bayesian’s “irrational prejudice.” There is no guarantee that Bayesians will converge when given the same information if the priors are wildly different.
Those are the philosophical arguments. In practice, Bayesian methods can work very well, and can help in evaluating frequentist procedures.
The whole point of a frequentist procedure is to avoid reliance on any prior. so the uniform prior is not practically useful. Think about Sander’s old post on credible priors:
Blockquote Sander: Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.
Blockquote
Does this not suggest that values at the extremes of a frequentist confidence interval are indeed far less likely to be “true” than values closer to the point estimate? Is the point estimate therefore (assuming we truly have no prior information) not the estimate most likely to be true?
It seems almost impossible for people to resist coming to that conclusion, but it is wrong. You are not entitled to make posterior claims about the parameter given the data P(\theta | x), without making an assertion of a prior. This becomes less of a problem when large amounts of high quality data are available; they would drag any reasonable skeptical (finite) prior to the correct parameter.
If we don’t have a good idea for a prior, I prefer the Robert Matthews and I.J. Good approach, that derives skeptical and advocate priors, given a frequentist interval.
His papers recommend deriving a bound based on the result of the test; I prefer to calculate both skeptic and advocate, regardless of the reported test result.
Note that the Bayesian-frequentist numerical equivalence (which as explained above should not even be brought up) works only in a very special case where the sample size is fixed and there is only one look at the data.
Thought I’d post a recent paper by Sander Greenland and Valentin Amrhein on the accurate language used to describe frequentist statistical summaries.
It is a reply to this (open access) paper:
Stefanie Muff, Erlend B. Nilsen, Robert B. O’Hara, Chloé R. Nater. Rewriting results sections in the language of evidence, Trends in Ecology & Evolution,Volume 37, Issue 3, 2022, Pages 203-210,
ISSN 0169-5347 link
@Sander and @Valentin_Amrhein have published an article on p value curves that is worth reading.
Amrhein, V., & Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. Journal of Information Technology. link
@Sander_Greenland, Mansourinia, and Joffe have a paper (open access through October) that applies S-values and the more modest compatibility interpretation of frequentist statistics in the most recent issue of Preventative Medicine to some Covid drug studies…
Here is a related paper by Mansourinia, Nazemipour, and Etminan :
New paper by @Sander_Greenland on the distinction between p-values as measures of divergence, and p-values as inputs to choose a decision.
Perhaps it would be better to refer to p-values (for an explicit data set) as percentiles and not probabilities. The “probability” of a p-value is conditional upon a model and entirely hypothetical replication under the assumed model. The actual p-value is retrospective.
I think Crane notion of “academic” (model outputs) vs “real” probabilities (prices at which bets are made) has relevance to the flawed probability interpretation of p-values.
Does this make sense for a methods section of a journal? If not how should this be reworded? Time to move this to the real world but we need language that end users (non-methodologists) will also be able to deal with
P values indicate the strength of evidence (considerable, some, little) against the assumption that the hypothesized null model had generated the data that led to the effect size(s) in this study. A small P value (usually < 0.004) indicates considerable evidence against the hypothesized null model as the data generation mechanism and which is small enough to begin to suggest that the data may be in contradiction with the hypothesized model and perhaps consider an alternative to it. P values ≥ 0.004 but ≤ 0.2 represent some evidence against the hypothesized nul model and P values > 0.2 indicate little evidence against the hypothesized null model as the data generation mechanism. It should be noted that small P values could also be due to random error and therefore replicability of such results are not guaranteed. The 95% confidence intervals, on the other hand, indicate a range of effects that have compatibility with our data, given our model after considering the extent of random error. This interval therefore indicates the random error related precision of the results. These intervals do not account for systematic error and therefore precision of the results may be overestimated.