GRADE guidelines and effect estimate & CIs

The GRADE approach (Grading of Recommendations Assessment, Development and Evaluation) is very widely used tool to evaluate the certainty of evidence in healthcare.

The GRADE approach is usually associated to meta-analyses pooling single point estimates.

The grade and statement goes as follows:
|High| = We are very confident that the true effect lies close to that of the estimate of the effect
|Moderate| = We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different
|Low| = Our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate of the effect
|Very low| = We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect

As an example from 1
There were no differences in CSA of multifidus at the C2‐3 level between the chronic NSNP and control groups (two studies[38], [47]; SMD –0.30 [95% CI –1.12, 0.51], P = 0.47) ([Fig. 4]. The GRADE quality of evidence was very low.

There were no differences in disc degeneration between chronic NSNP and control, when assessed using a four‐point ordinal scale[49] and the five‐point Pfirrman scale[50] (two studies[49], [50]; OR 0.84 [95% CI 0.57, 1.24], P = 0.39) ([Fig. 5]. The GRADE quality of evidence was moderate.

(I have not idea about the scales in the second example, but I guess an ordinal scale is dichotomized to binary outcome, why OR is used.)

Stating that there is no difference is obvisouly wrong. In the first one the level of evidence is graded very low, which is reasonable since data is consistent with wide range of plausible values.

But then the second one. Authors claim that the quality of evidence if moderate indicating per definition “We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different”.

Again, CI for point estimate is consistent with wide range of values. It is not right to declare “no difference”, it might be there. According to the statement, authors have confidence on “the effect estimate”. If the point estimate is not exactly 1.000, doesn´t that mean that they believe that the other group have lower odds for disc degeneration (OR=0.84), ie. they reject the null?

On the other hand, if the the effect estimate includes interpretation of CIs I don´t understand how you can have confidence on a point estimate and CI, since the statement IMO implies “no further studies needed, this is it for the population value”, which would again exempt the use of CIs.

To conclude, it seems counter intuitive to state there is “confident in the effect estimate” which, however, differs from 1.000 (OR, RR) or 0.000 (SMD, MD) while at the same time the CI for point estimate does not exclude 1 (or 0) indicating that null hypothesis cannot be rejected. This very often of course is considered as “no difference” or “equal” which can be debated.

In other words, is there an inherent flaw in the use of GRADE system or am I missing the point here?

1 Like

I had always interpreted the GRADE criteria as providing appraisal for aspects unrelated to confidence intervals, i.e. making recommendation for an intervention is more than just the estimate for risk reduction of primary outcome.

The handbook provides specific examples:

Not having read the handbook, I very much resonate with @aleksi. To answer his odds ratio question, a truly ordinal analysis such as the proportional odds model (a generalization of the Wilcoxon two-sample test) results in a single odds ratio. It is the ratio of odds of exceeding a given level of severity, for two treatments. By the proportional odds assumption this ratio is the same no matter which severity threshold you use.

Seems they are actually looking for a posterior distribution, where the width of the x% Highest Density Interval (or similar measures) indicate how certain you are.

Under a number of assumptions CIs don’t differ too much numerically from HDIs, you should check it these assumptions hold true.

^^ nice article describing that.

@Raj I understand that they consider all relevant aspects such as bias And heterogeneity and imprecision which contribute to the ”confidence”. But still, to me it seems that it reduces to interpretation of point estimate. And if ”confidence” is high meaning that no further studies needed, I see a conflict interpreting findings failing to reject null.

@pieter I Dont see any mention about HDI or credible intervals in the manual?

Preface: I’m not familiar enough with GRADE process to comment with authority, but I will offer my thoughts on how I read studies like this. Also I agree that statement of “no difference” is a flawed interpretation given such wide confidence interval.

To go into GRADE “confidence of point estimate” - I had always interpreted this as part of the process to evaluate risk of bias/confounders, especially in observational studies. I.e. if their was a confounder that could not be adjusted from available measures in data, it would throw into doubt accuracy of point estimate.

Specific to your example, the authors discuss in the bias section that their was disagreement on “measurement
of different levels of exposures” (ie, stratification of WAD or
NSNP symptoms). If the discrepancy was large, and many patients with more mild symptoms were categorized as more severe in the data (or vice versa), I would be hesitant to accept accuracy of the confidence interval. As it turns out, the authors think the discrepancy was not great, nor did they find other obvious biases. So to me, it seems reasonable to comment on accuracy of the estimate, i.e. “true effect is likely to close the estimate”.

My take away from this nice paper is that correlating MRI findings to neck pain is very difficult, something that is consistent with my real life experiences with patients.

You are right. They did this: “By taking the 2.5% and 97.5% percentile points of this distribution, one achieves the 95% credible interval.”

There are several ways to determine 95% credible intervals. Any interval that contains 95% of the posterior mass suffices. It is described clearly in this paper:

My point was that it is a fallacy to interpret a narrow confidence interval as precision or certainty of the estimation. The guidelines seem to look for is a Bayesian credible interval (of whatever type, HDI or other)

See here for more on the confidence interval fallacies:

The authors identify three fallacies regarding confidence intervals:
Fallacy 1 (The Fundamental Confidence Fallacy)
“If the probability that a random interval contains the true value is X%, then the plausibility or probability that a particular observed interval contains the true value is also X%; or, alternatively, we can have X% confidence that the observed interval contains the true value.” (wrong)

Fallacy 2 (The Precision fallacy)
“The width of a confidence interval indicates the precision of our knowledge about the parameter. Narrow confidence intervals correspond to precise knowledge, while wide confidence errors correspond to imprecise knowledge.” (wrong)

Fallacy 3 (The Likelihood fallacy)
“A confidence interval contains the likely values for the parameter. Values inside the confidence interval are more likely than those outside. This fallacy exists in several varieties, sometimes involving plausibility, credibility, or reasonableness of beliefs about the parameter.” (wrong)


Genuine question here: if I follow a strict interpretation of confidence intervals, how can I make any meaningful interpretations of study results? Is it really unreasonable to assume a flat prior and interpret confidence intervals like credible intervals?

GRADE system (like any method) is not perfect, but if the average reader uses the folk interpretation of confidence intervals (correct or incorrect), it seems reasonable to add some comments about perceived accuracy of the point estimate.

My personal opinion is that if you are using frequentist methods you should not mix in Bayes. If you want to be Bayesian be Bayesian. And don’t use flat priors. Confidence intervals deal with long-run properties of the method.

1 Like

I agree with not mixing methods when interpreting results within a study.

But when applying evidence (in my case toward patient care), I also need to assess external validity. And not just “do I think hypothesis is true or false”, but expected relative and absolute effect size for intervention. It seems reasonable to take a Bayesian approach for this process of externalizing validity, even if results are frequentist.

But, If there is another/better approach, I would love to learn!

1 Like

Very good questions, as matter of fact I don’t have an answer!

1 Like

I suppose this is what makes us all Bayesians in the end! :slight_smile:

1 Like

The interpretation of confidence intervals (better called compatibility intervals) is akin to the interpretation of cross-validation or bootstrap validation of a predictive model. Resampling methods validate the process used to fit the model, not the final fitted model per se. But the likely future performance of the process is still a good estimate of the likely future performance of “the” model.


Yes, the value of frequentist methods is in being reasonably correct in the long run. This doesn’t help with the interpretation of single studies (or single confidence intervals) though… @Raj

In the words of Neyman: “no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint……we may search for rules to govern our behaviour……we insure that, in the long run of experience, we shall not often be wrong.

If I paraphrase that: “a single confidence interval does not tell you what the population mean and dispersion are. But…we insure that, in the long run of experience, we shall not often be wrong.”

So the price for long run error control (and for not having to pick a prior distribution) is that inference from individual experiments is not possible.

Correct me if I’m wrong.


Am I the only one who finds it odd, that the guidelines for “evidence based medicine” are based upon a philosophy of decision theory that selects for rule sets that are indifferent to the state of nature?

This discussion of frequentist vs Bayesian analysis was very helpful to me.
Beyond Bayesians and Frequentists

The essential difference between the 2 perspectives is the loss function. Bayesians deal with the approximating the average case, Frequentists deal with approximating the worst case.

I don’t dispute that the frequentist attitude has value in many contexts. But that attitude also makes it difficult to accumulate and synthesize knowledge. In the context of research synthesis, I am interested in the average case, or at least creating a defensible, informative prior from previous research reports to decide on whether new research should be done, or if the findings are reasonably consistent (even after adjustment for issues like various types of bias), that an action/policy can be decided upon now.

At least from a research synthesis POV, more work needs to be done from a likelihood perspective. That separates the questions of:

  1. What should I do?
  2. What should I believe?
  3. What do the data say?

It is question 3 that is of interest to the research synthesizer. Ultimately, researchers need a better grounding in decision theory to make sense of these issues, than a mastery of “off the shelf” cookbook techniques, whether Frequentist or Bayesian.

1 Like

I don’t see a reason to say that Bayes is interested in the average case and frequentist is interested in the worst case.

1 Like

I disagree with this:

2.3. When To Use Each Method
The essential difference between Bayesian and frequentist decision theory is that Bayes makes the additional assumption of a prior over \theta, and optimizes for average-case performance rather than worst-case performance. It follows, then, that Bayes is the superior method whenever we can obtain a good prior and when good average-case performance is sufficient. However, if we have no way of obtaining a good prior, or when we need guaranteed performance, frequentist methods are the way to go. For instance, if we are trying to build a software package that should be widely deployable, we might want to use a frequentist method because users can be sure that the software will work as long as some number of easily-checkable assumptions are met.

^this misses the point (and/or is simply wrong imho). Frequentism and Bayesianism answer different questions. Are you ultimately interested in Pr(data |\theta) or in Pr(\theta|data) ? Are you interested in long term error control or in learning from the specific data at hand?


I agree there are some points to take issue with. I absolutely disagree with his claim that Bayesian techniques are weak in adversarial contexts. That is a very narrow look at Bayesian theory. But the broad point of attitude toward the loss function is viable one to distinguish between the Frequentist philosophy and the Bayesian one. He is not the only one to advocate for it.

As Prof. Harrell disputes this point as well, I am gathering some references in defense of it.

But suffice it to say, statistical philosophy is a continuum, not discrete. Analysis methods can go from extreme emphasis on mini-max decision rules, to a full subjective Bayesian approach that optimizes the subjective expected utility of one decision maker. There are also many points in between, including Empirical Bayes which is a reasonable compromise.

Michael I. Jordan: Are you A Frequentist or Bayesian
First 15-20 minutes is most relevant.

For those pressed for time, the first 13 slides are most relevant.

Old blog post by Larry Wasserman comparing Frequentist “confidence” (aka coverage or compatibility) interval, with a Bayesian credible interval. Suffice it to say, as long as the Frequentist sticks to offering no more than 19 to 1 against the next observation falling outside his interval, there is no way for the Bayesian to take advantage of his superior knowledge of the parameter value (assuming his prior is correct).

1 Like

Followers of this topic should find this empirical evaluation of Cochrane reviews using GRADE useful:

Paraphrasing – “low quality” evidence had less variability than “high quality” evidence, which is just the opposite of what we would expect to see if the rating process was valid.

Larry’s blog article is excellent, and so are some of the severe criticisms of it in the discussions that follow.

1 Like