I had originally contacted Frank Harrell with this issue and he suggested I post here for some discussion. While reviewing a JAMA article (doi:10.100/jama.2018.14276) attempting to understand the application of a Bayesian analysis of existing RCT data, I happened to run across this NEJM article (doi/10.1056/NEJMoa1900906).
Part of my interest in this area is to address the woefully inefficient clinical trial process we have evolved to with nearly every clinical question requiring a large RCT. A “failed” trial means either “rejecting” the hypothesis or repeating a larger version. Most worrisome to me is that in many situations such as off patent drugs, there is insufficient funding or incentive for such a trial. Furthermore, with nutritionally related questions such as caloric intake, food composition, or vitamins/ minerals/ supplements, the interventions are treated as drugs where the placebo arm is treated as no intake which does not make sense. Finally, for non-pharmacologic interventions such as exercise, traditional RCTs of sufficient magnitude will never take place. Thus alternative approaches are desperately needed.
Below is my original email to Frank:
The NEJM article basically concludes that vitamin D does not lower the risk of diabetes. They powered their trial for a 25% reduction in risk. When I look at figure 3, it appears that there was reduction, but not to the degree they hypothesized. Furthermore, most of the subgroups demonstrate trends in the direction that I would predict are consistent with their hypothesis. For example, lower serum levels of vitamin D (25-hydroxyvitamin D) which should respond better to normalization of serum levels display a slightly lower risk. Blacks would be expected have lower serum level also demonstrate a greater risk reduction. Obese individuals who need higher intakes of vitamin D (vitamin D partitions to fat tissue), don’t respond as well as non-obese people; there’s a similar effect with waist circumference. Recommendations for vitamin D intakes are higher in the elderly and a larger response is seen. Finally, there’s a bigger effect in individuals from higher latitudes which would also be expected to start with lower vitamin D status.
In summary, all the directions of the various subgroups are consistent with the overall hypothesis, but because of the expectation of a specific effect size, they conclude no effect. An accompanying commentary does remark that there may be a smaller effect size, but this would require another larger trial.
My question relates to how Bayesian analysis can extract some useful information from this data set as well as what would need to be set up at the outset to allow a Bayesian analysis so that we’re not always in the position of looking at a “failed” trial and either carving out specific subgroups for a follow-up trials or simply to lather, rinse, repeat with a larger trial? Getting away from yes / no trials to an approach that can offer a spectrum of results would be truly innovative as well as accelerate our ability to translate clinical concepts to medical practice.
You make excellent points about other interventions not having enough funding for an RCT. I’ve asked myself similar questions.
In terms of the primary study – if you assume an effect distribution around zero before seeing the data, then use their data to update your prior, you can provide evidence for a range of effects that were not initially planned for in the study.
I would also think a Bayesian analysis of the subgroups could be persuasive evidence in terms of the effect.
Edit: Links to related and relevant posts on this topic:
From a broader POV, a better analysis of RCT data can (when examined from a Bayesian decision theory POV) lead to the derivation of an experiment that will decide the clinically relevant question.
Here is what I think after having given this issue a lot of thought. It seems reasonable to me, but I would value some additional input from scholars in this area.
A quick way to describe my thoughts would be a Bayesian parametric meta-analysis of non-parametric primary effect sizes.
My main emphasis would be on the effect estimate (regardless of significance), and the design (to see if there need to be downward adjustments to precision based on errors in the analysis such as dichotomization, improper change score analysis, etc. Dr. Harrell lists a number in his free booklet Biostats for Biomedical Research AKA BBR).
My preferred estimate of effect would be some sort of odds ratio related to the logistic model. I think parametric effect sizes based on standardized means are more fragile than is understood.
Standardized mean differences are easily translated into log odds.
See the following for an informal proof of translating means into odds:
The actual ratio to multiply standardized mean effect by is \frac{\pi}{\sqrt{3}}.
I guess you can say I share Dr. Harrell’s preference for the Wilcoxon-Mann-Whitney as my default 2 sample test, and the logistic model from which it is derived.
You could do a meta-analysis of the relevant trials, adjust for publication bias, then do a bootstrap on the corrected effect size estimates.
Points inside the bootstrap CI could be defensible point estimates to base a Bayesian prior distribution on. If the 25th percentile of the bootstrap distribution is assumed to be the mean of a normal distribution – is it far enough from 0 that another study would be hard to justify?
More complicated models would require the use of meta-regression. The logistic model would be a natural fit here.
Empirical Bayes techniques have been described in this area that might help you persuade the dogmatic frequentists. Using an Empirical Bayes approach gives you a posterior distribution than can be interpreted as a predictive model for future studies.
I’ve already collected a number of papers related to the issue of meta-analysis in this thread
Michael your questions are excellent. There are a lot of basic issues that pertain that I’d like to start with. First, in the Pittas vitamin D paper, the NEJM did what they so often do: make the “absence of evidence is not evidence of absence error” in completely misinterpreting the p-value of 0.12. Bayesian posterior inference has many advantages, and here is one of them: the posterior distribution tells you to what extent you can conclude that two treatments are similar. For example you can compute P(hazard ratio between .9 and 1/.9 | data) if your “similarity zone” is a 10% reduction in hazard, up to the corresponding increase. To the issue of efficacy of vitamin D, the assessment of any efficacy is P(hr < 1 | data) given your prior. Any authors who want to conclude that a treatment should not work should go through this exercise.
A second way that Bayes helps is that all efficacy assessments in Bayes are directional. Contrast that with a 2-tailed test. A 2-sided p-value is effectively using a multiplicity adjustment for the off chance that you may want to make a claim that a drug increases mortality. By being interested only in amassing evidence for a mortality reduction, Bayes provides a higher posterior probability of efficacy than you would imagine from a 2-sided p-value.
Finally, I’ll briefly address the central question of how do we quantify evidence and how does this relate to decision making. Since a p-value is the probability that someone else’s data will be more extreme than yours if your null hyothesis is true for them, it provides no direct evidence for an effect whatsoever. Because of that, we don’t have much of a clue for when to act as if a treatment is effective, and we don’t have a clue about the chance that we will be making a mistake in taking a certain action. By contrast, a posterior probability of efficacy of 0.94 immediately tells us that if we act as if the treatment is effective we have a 0.06 chance of being wrong.
An optimum Bayes decision does something like maximizing expected utility. Expected utility is a function of the entire posterior distribution and the utility function. When the utility function puts a high penalty on using a drug when it doesn’t work (regulator’s regret), higher values of the posterior probability of efficacy will be required to make the decision of approving a drug. On the other hand, when patients do not have an alternative treatment available, as in rare diseases or Alzeimer’s, or when a drug is cheap and has no side effects, most people’s utility function will be such that just playing the odds will give a good decision. So in some cases if the probability of efficacy is 0.51 or greater it would be wise to act as if the drug is effective, and use it.
The latter issue underscores the silliness of the NEJM paper’s conclusion.
I’m convinced that’s the way it is. The difficulty is that simple tutorials or even books on Bayesian survival analysis in R are still rare today. This is a problem given that the selection of priors is subtle and nuanced. Do you have any suggestions?
I’ll soon be released a semi-comprehensive set of handouts on Bayesian methods in treatment comparisons, and will follow that up with a journal article that will hopefully be a tutorial. These papers should help.
Frank, it’s very stimulating. Your commitment is remarkable, as always. I’ve been reading the BEST package tutorial tonight, and the Kruschke article. Very interesting and useful. The graphics are very nice, and the information it provides is incredible. However, note, there are no simple tutorials on Bayesian survival analysis. The explanations about available R packages on that topic are quite complicated, without many examples. Only real experts in the field can really take advantage of those explanations about Bayesian survival methods in R, so they don’t play any didactic role. Today it is imposible for me to learn bayesian survival analysis with these texts. I’d have to stop practicing medicine and study for two years. The lack of tutorials I think is the main reason why Bayesian statistics are not popular. No one knows how to handle bayesian survival analysis (at least in R) except a few experts in the world. Just look at the explanation of the spBayesSurv package, possibly the work of a great genius, but an indecipherable hieroglyph for most people. I think a good contribution could be made from your field to ours, if any of you could bridge the gap between those apparently complicated algorithms and clinical reality. Perhaps a specific thread, similar to the dictionary of statistical terms, could be opened in Datamethods to gather simple information on Bayesian survival tutorials, when available.
We definitely need more tutorials in Bayesian survival analysis. I’ll keep a lookout for them. In the meantime look at this and ask for guidance here. Also this. Maybe one of us will do a tutorial with spBayesSurv.
That’d be great. In the meantime, in addition to the above contributions, I’ve seen that there are plenty of accessible tutorials on Bayesian statistics, including survival analysis, in Stata15.