# Frequentist vs Bayesian debate applied to real-world data analysis: any philosophical insights?

I’ve been reading about the benefits of the Bayesian versus frequentist approach in clinical trials. However, I don’t know if there are any specific insights applicable to the real-world data scenario, with observational studies that have an increased risk of bias. What can be said about the Bayesian - frequentist debate that is specifically useful for observational studies? Does anyone have any kind of pretty philosophical ideas about it?

The attributes of Bayes that make the approach attractive for randomized experiments are also available for observational research. Here are a few:

• Most users of frequentist methods do not realize that except in rare cases such as when normality and constant variance assumptions hold, p-values and confidence intervals are approximate. The most frequently used approximations, based on Wald statistics, become worse the more non-quadratic the log-likelihood function. The binary logistic model is a good case in point, and so are mixed effects models. Bayesian calculations are exact.
• As detailed in Nate Silver’s The Signal and the Noise, a Bayesian approach provides a way to obtain actionable evidence when randomization is impossible. His beautiful example is the evidence for cigarette smoking causing lung cancer. None other than Ronald Fisher himself argued that one cannot learn from observational data in this context because “lung cancer could cause smoking”, so without randomization no conclusions could be made (he had significant consulting income from the tobacco industry). Silver showed how a Bayesian analysis that is very skeptical about the effect of smoking on lung cancer is still definitive, and would have probably made the US Surgeon General issue his warning 10 years sooner.
• Bayesian methods allow for incorporation of non-data information, which for observational studies is even more important than for experiments. Think about all the studies claiming to show that food x is associated with health outcome y. Bayesian analysis can place extreme skepticism on such associations.
• There’s always the property of Bayesian posterior probabilities that they are in predictive, actionable, mode rather than a backwards-information-flow mode.
• The job of a Bayesian analysis is to reveal the hidden truth in the sense of what data generating mechanism generated the observed data. This mechanism may be a one-time event with no repetitions possible. Frequentists must envision an infinite repetition of the same experiment to get long-run operating characteristics. Bayesians are more interested in “what do we have now?”.
• As Andrew Gelman has written, type S errors (getting the wrong sign on an effect) are real dangers, especially in observational research. Some of the risk of type S errors is due to effectively using a flat prior for the parameter of interest, either by being frequentist or by using a Bayesian non-informative prior. Informative priors that disfavor extreme values and completely rule out impossible values (e.g., smoking improving lung function; aging making you less susceptible to arthritis) improve the situation.
4 Likes

not sure if this fits ‘observational study’, but i read this paper recently: Handling Multiplicity in Neuroimaging through Bayesian Lenses with Hierarchical Modeling

It was attacked on pubpeer by one individual, but i thought the authors handled it well: pubpeer comments

but i’m interested in the Q and wish i knew more about it… I’ll read @f2harrell 's comments…

3 Likes

Wonderful answer indeed, thank you very much. A condensed lecture in a post. From this answer, I understand that one of the specific advantages of Bayesian analysis, applied specifically to real-world data (RWD), is that using priors offsets some of the biases inherent in observational studies. This could mean a specific advantage of Bayesian analysis in these type of designs. Right?
For example, in situations where there is a lot of evidence from RCTs, one could use the background information from these studies, and try to improve applicability to special subgroups (e.g., the elderly, or patients with special risk factors…) by applying Bayesian regressions to new RWD focusing on specific populations. What do you think?
PS: As an oncologist, I’ve been fascinated by the story of Silver and tobacco. I’m going to buy the book.

2 Likes

I like those thoughts. And I highly recommend Silver’s book. It’s both scholarly and highly readable.

2 Likes

In retrospect, and although the subject is now a bit old, I would like to update the influence of this comment by Professor Harrell. After reading this passage I spent some time reviewing the Doll & Hill article, Fisher’s criticisms with regards of causality claims, and finally came to Jerome Cornfield’s article, to which Frank Harrell alluded. Right around that time, during confinement, I read Lippi’s meta-analysis, which claimed that active smoking did not worsen the severity of Covid-19 with the typical error ‘lack of evidence is not lack of effect’ (OR, 1.69; 95% CI, 0.41-6.92; p=0.254). The analysis consisted of 5 studies, slightly more than 1000 patients:

The parallel with Cornfield’s history was so tempting that I couldn’t help but redo Lippi’s meta-analysis with a Bayesian approach, finding a posterior probability of harm of 95%:

This short abstract summarized it all.
Finally, almost one year later, an updated meta-analysis was published one month ago that resolved the issue from the frequentist approach:

The authors confirmed that active smoking worsened the severity of Covid-19, as expected (OR=1.58; 95% CI: 1.16-2.15). Interestingly, this required >300,000 patients from 40 studies.
It is therefore true that Bayesian statistics allows pragmatic and reasonable conclusions to be drawn many months earlier than frequentist statistics, and is therefore very suitable when urgent answers are required, in situations of high noise or uncertainty.
I found it useful to share this experience.

5 Likes

I think we need to keep in mind the important distinction between correct frequentist statistics, and simply bad statistics.

@Sander posted a link to a youtube video (I’ve also posted it here in a few different threads) on p-values, confidence intervals, and their common, but incorrect interpretations.

It was mentioned in passing in the video that Jeffries and Fisher would pose statistical puzzles to one another. Jeffries commented that they rarely disagreed in their interpretation using the methods each of them had derived.

(If anyone has a source for this aside from the video, I’d appreciate it).

Just looking at your first link made me very curious as to how they came to this conclusion. I’m no physician, but considering what is known about smoking, I find the idea of “no association” with a serious respiratory disease implausible, to put it charitably.

They did an effect size meta-analysis of 5 studies (!), where 1 study had nearly 90% of the total weight. Their figure 1 lists the CI for odds ratio estimated by this study as 1.51 (0.97 - 2.36).

To call this a “meta-analysis” or “evidence synthesis” when most of the weight is driven by 1 study is misleading at best.

Even worse, they misinterpreted their own data. It is erroneous to use the inclusion of the null in an aggregate CI as a short cut to doing a hypothesis test, especially if the data are of questionable homogeneity. You can aggregate the actual signal from the individual studies by extracting the 1 side p value and combining using one of the many available methods.

As someone who is sympathetic to the Bayesian philosophy, I still find the correct use of p-values helpful. I combined Bayesian reasoning with a frequentist p-value combination procedure in another thread:

There is a Bayesian way to examine this report from the estimation perspective. Robert Matthews has adapted an important technique inititally developed by IJ Good that flips the problem around, and derives the prior that justifies when a finding is credible. It puts frequentist reports in a Bayesian context that is so often missing, leading to errors of interpretation. One of his more recent articles was published by the Royal Society (not long after the ASA’s p-value position paper).

Matthews Robert A. J. (2018) Beyond ‘significance’: principles and practice of the Analysis of Credibility R. Soc. open sci. 5: 171047. 171047 (link)

Taking their aggregate estimate a face value, they report an OR of 1.69 (0.41 - 6.92). Is it credible, from a Bayesian perspective, that there is “no association?”

I set up a spread sheet that uses the formulas Matthews derived in this paper. Using the expression of an “honest advocate”, I’ve derived that the he/she could have a 95% prior credible interval as high as 11.2 to 1 in favor of smoking being a factor in worse outcomes for Covid patients.

These data are entirely consistent with smoking being a negative factor for Covid. Anyone who is skeptical should justify this belief.

As far as philosophy: I recommend an old paper by Bradley Efron (with commentaries), titled:

I come to find the comment by Herman Chernoff on this paper especially wise:

Blockquote
With the help of theory, I have developed insights and intuitions that prevent me from giving weight to data dredging and other forms of statistical heresy. This feeling of freedom and ease does not exist until I have a decision theoretic, Bayesian view of the problem … I am a Bayesian decision theorist in spite of my use of Fisherian tools.

Addendum: @Stephen wrote another paper from a philosophical POV that is worth study.

Senn, Stephen, (2011), You May Believe You Are a Bayesian But You Are Probably Wrong, Rationality, Markets and Morals, 2, issue 42, number 42, https://econpapers.repec.org/article/rmmjournl/v_3a2_3ay_3a2011_3ai_3a42.htm

3 Likes

Your comments are very appreciated. Thank for your time and your very interesting remarks.
Can you please expand how you estimated this credible interval or share de spread sheet?
I would like to have a look at that issue…

1 Like

Robert Matthews did all of the creative work in deriving the algebraic expression. He has described them in this paper, as well as the one I linked to above. Both are worth reading. Further work in this area has been done by Leonhard Held.

Conventional reports dichotomize results into “significant” and “not significant”. Given only the CI, plug in the limits into this formula to derive the “advocacy bound” when given a “not significant” report.

For ratios, the advocacy limit (AL) is:
AL = exp(-\frac{ ln(UL) \times ln^2(\frac{U}{L}) } {2\times ln(U) \times ln(L)})

The Skeptical Limit (SL) for ratios (when given a “significant” report) is:
SL = exp(\frac{ln^2(\frac{U}{L})} {4\sqrt{ln(U)ln(L)}})

In both formulas, U = upper limit and L = Lower Limit

My spreadsheet should really be an R script, but I haven’t spent the time to write one up with validated test cases yet. The spreadsheet is close (up to 3 significant figures) to reproducing the examples given in his articles.

3 Likes

OK, I have now read Matthew’s article and would love to check that my understanding is correct, as I have difficulty understanding credibility in the absence of prior evidence.
The authors of the meta-analysis report an OR of 1.69 (0.41 - 6.92).
Applying the credibility analysis we would have:

U<-6.92
L<-0.41

AL<-exp(-(log (U*L) * log (U/L)^2)/(2* log (U)* log(L)))

AL


Where the advocacy limit (AL) is 11.1.

This means that if any prior evidence in favour of a harmful effect of smoking on Covid-19 severity was available with 95% CI between 1-11.1, the association would be credible because the posterior OR 95% credible interval would be >1.
Since tobacco is bad for all respiratory diseases, so it is most likely in fact that it is also bad for Covid-19. However, let us imagine a situation in which there really was no prior information, or the above argument had to be dismissed as subjective. In this case, I find it harder to understand credibility with a non-significant result.
In the absence of prior evidence, let’s assume a new field, one would have to use the central estimate M of the likelihood, i.e. 1.69. That midpoint lies between 1-11.1, but the lower bound is less than 1, so it would not have enough weight for the posterior OR 95% CrI to be >1. In such a case, the result would still not be credible. Is it correct?
Would it not be better in this circumstance to apply the Bayesian analysis with a neutral prior? In this particular example we found a posterior probability of effect size OR >1 equal to 95%.

2 Likes

Blockquote
This means that if any prior evidence in favour of a harmful effect of smoking on Covid-19 severity was available with 95% CI between 1-11.1, the association would be credible because the posterior OR 95% credible interval would be >1.

That is also how I understand the technique. The interval is very wide, so a “non-significant” result shouldn’t be used as justification to stop further research that might narrow down the precision of the estimate.

Blockquote
In this case, I find it harder to understand credibility with a non-significant result.

Context will need to be used to provide a guide. If you had a high prior odds that there was some effect, you might do another study and control for factors you hadn’t initially considered. If this was exploratory, and you had low prior odds in the validity of the effect, you might end your research process here.

(I’d caution taking the point estimate of a retrospective meta-analysis at face value, due to heterogeneity.)

This goes back to the issue of how frequentist surprise measures relate to Bayesian priors.

I think this is how Chernoff interpreted Fisher’s methods in a Bayesian way. Rather than set an entire probability distribution, pick a point of reference as a default and calculate a statistic that isn’t as sensitive to prior assumptions. But design an experiment using N and \alpha so that it gives you the posterior odds necessary for the question.

IIRC Fisher advised using the smallest sample that would answer the question at hand. This implicitly combines NP power with the value of information.

Note the distinction between “plausible” and “dividing” hypotheses. At least in treatment contexts, we are looking at a “dividing” hypothesis. In other areas (ie. genomics) there is a strong prior in a very small interval null, that is approximated by using a point null as a probability model as a plausible default.

As for “intrinsically credible” significant results, it seems statisticians are re-discovering ideas that the early pioneer in manufacturing statistical process control (SPC) Walter Shewhart, developed. It is unfortunate his ideas were not mentioned in the debates about \alpha levels a few years ago.

Blockquote
Thus, when the process is being operated predictably, Shewhart’s generic three-sigma limits will result in values of P that are in the conservative zone of better than 97.5-percent coverage in virtually every case. When you find points outside these generic three-sigma limits, you can be sure that either a rare event has occurred or else the process is not being operated predictably. (It is exceedingly important to note that this is more conservative than the traditional 5-percent risk of a false alarm that is used in virtually every other statistical procedure.

In the article, he compared a how a very low p-value threshold would detect differences among large families of probability distributions. The 3 sigma limit was a useful rule, and has been used for close to 100 years now.

Here is a more formal presentation where a \pm 3-sigma limit is used to filter out random noise relative to many families of probability distributions.

I’ve always thought the SPC methods were a very reasonable use of basic statistics. Without any real assumptions, very low p values / high sigmas (ie. mean \pm 3SD ) will alert to departures from a predictable, controlled process over 97% of the time.

The use of p-values in SPC is in agreement with Sander’s observation that p-values only alert you to a violation of 1 or more assumptions of the model used to calculate it. That is why I find the term “surprisal” so useful.

In SPC, the sigma threshold (calculated from observational historical data) is used as a guide on when to invest resources in research to improve the process. Perhaps we should think of other types of research that way.

1 Like