Time for a change in language: statistical divergence and the 95% non-divergent interval (95%nDI)

Whether these concepts are intrinsically defective seems to be a point of endless debate among experts. But there’s no arguing that any method/concept that is so convoluted/counterintuitive that it is still misapplied/misinterpreted by a large fraction of researchers and readers, even after decades of use, is practically defective.

Bayesian framing of study results often feels more straightforward and less vulnerable to misinterpretation by non-expert readers. For this reason, it’s heartening to see more publications using these methods. But I do wonder about the extent to which Bayesian methods themselves are vulnerable to misapplication by sub optimally-trained researchers (?) Any advantage that Bayesian methods might have with regard to ease of interpretation of results could easily be nullified by inexpert Bayesian study design.

Questions:

  1. To what extent are Bayesian methods currently emphasized within undergraduate and graduate statistics programs?

  2. What proportion of already-credentialed professional applied statisticians would you estimate are competent at designing trials using Bayesian methods?

  3. Assuming that a sufficient proportion of professional statisticians becomes expert in Bayesian methods by a certain future date, do you foresee that this teaching could, in turn, be propagated correctly to those who usually provide instruction to a clinical audience? Often, those who instruct clinicians on critical appraisal have some epidemiology or stats training, but primarily focus on clinical practice.

  4. If the answer to question 3 above is “no,” then do you anticipate that clinicians will simply need to concede ignorance on certain methodologic aspects of critical research appraisal (arguably, many of us already act like we understand frequentist methods when we really don’t…)? Would it be best for clinicians to limit their role in the design and appraisal of studies using Bayesian methods to certain aspects of the process (e.g., prior elicitation?), and to forgo any attempt to understand the nittier/grittier parts?

1 Like

Frank what do you think about this comment from Senn:

I am in perfect agreement with Gelman’s strictures against using the data to construct the prior distribution. There is only one word for this and it is cheating'. But I would go further and draw attention to another point that George Barnard stressed and that is that posterior distributions are of little use to anybody else when reporting analyses. Thus if you believe (as I do) that far more use of Bayesian decision analysis should be made in clinical research it does not follow that Bayesian analyses should be used in reporting the results. In fact the gloomy conclusion to which I am drawn on reading (de Finetti 1974) is that ultimately the Bayesian theory is destructive of any form of public statistics. To the degree that we agreed on the model (and the nuisance parameters) we could communicate likelihoods but in practice all that may be left is the communication of data. I am thus tempted to adapt Niels Bohr's famous quip about quantum theory to say that, anyone who is not shocked by the Bayesian theory of inference has not understood it’.

RE: Senn’s reported quote. Can you provide a source for this, or is it a private communication?

FWIW, I’d strongly disagree with it, and I’m surprised an experienced statistician such as Senn would express it.

The problem with this attitude is that the only frequentist procedures we care about are the “admissible” ones, which coincide with the Bayesian procedures.

Whether they like it or not, Frequentists (to the degree they are rational) are Bayesian. It is also true that Bayesians (when they are planning experiments) are also Frequentists.

As for:

I’d argue the existence of public option markets is an obvious and direct refutation to this claim. All sorts if intangible (aka “subjective”) information is incorporated into the price of assets and risk transfer mechanisms that permit the very scientific apparatus statisticians are apart of, to be funded.

Seymour Geisser noted long ago in:

Geisser, S. (1975). Predictivism and sample reuse. University of Minnesota. (link)

The fundamental thesis of this paper is that the inferential emphasis of Statistics, theory and concomitant methodology, has been misplaced. By this is meant that the preponderance of statistical analyses deals with problems which involve inferential statements concerning parameters. The view proposed here is that this stress should be diverted to statements about observables. …

This stress on parametric inference made fashionable by mathematical statisticians has been not only a comfortable posture but also a secure buttress for the preservation of the high esteem enjoyed by applied statisticians because exposure by actual observation in parametric estimation is rendered virtually impossible. Of course those who opt for predictive inference i.e. predicting observables or potential observables are at risk in that their predictions can be evaluated to a large extent by either further observation or by a sly client withholding a random portion of the data and privately assessing a statistician’s prediction procedures and perhaps concurrently his reputation. Therefore much may be at stake for those who adopt the predictivistic or observabilistic or aparametric view.

Geiser points out in the introduction to his book Predictive Inference, that:

…only the Bayesian mode is always capable of producing a probability distribution for prediction (my emphasis).

This implies to me that Bayesian methods (whether Subjective Bayesian that Senn is implicitly condemning or Objective Bayes, which is the best a Frequentist can hope for without good prior information) are more receptive to public challenge and verification than the limited frequentist methods taught in the B.C. era (Before Computing).

The caricature of frequentist methods taught in medical statistics is at best, a fun house image of legitimate non-Bayesian procedures (ie those which emphasize the generation of a valid likelihood function or set of coverage intervals which are isomorphic) that focus on protocols for data collection in the case of a priori uncertainty. An important step forward in presenting Frequentist theory (via the construction of “confidence distributions”) is:

I recommend it highly, whether your statistical philosophy is Bayesian or not.

I’d like to know which parts of Senn’s quote you think are erroneous.

It is not clear that Bayesian have to be frequentist at any stage. As shown here you can be fully Bayesian when designing experiments.

Back to the problem at hand, here are my hypotheses:

  • Assumptions of Bayesian inference are easier to teach than sampling theory
  • Interpretation of Bayesian analysis is far easier to teach than frequentist analysis
  • Bayesian mechanics is harder to teach, but the advances in software over the past 3 years is stunning and helps here
  • Cheating with Bayes is easier to detect
  • Senn’s quote about Bayesian being truly shocking if you understand it is spot on. It is so shocking that I know of many PhD statisticians with > 10 years of real experience still confuse Pr(A|B) with Pr(B|A), not even realizing that they are engaging in transposed conditionals in their clinical trial interpretations. One very senior statistician said at a meeting that you have to know \alpha to know the chance that the approved drug doesn’t work. This is exactly what \alpha isn’t.
  • Once you accept Bayes as a standard for quantifying evidence for simple assertions (e.g. the treatment works in the right direction) you immediately gain the benefits of an easy way to answer what’s really needed, e.g., Pr(treatment benefits the majority of 5 clinical endpoints). That is impossible to do with a frequentist approach.
4 Likes

Quick update:
After the groups meeting and consideration of RG and datamethods discussions the consensus reached was to call the interval a sub-divergent interval (sDI) in lieu of a non-divergent interval. In summary the linguistic use is as follows: Statistically divergent implies disagreeing with the target model as the source of the data at the set divergence threshold e.g. p=0.05 or 5th percentile location. Statistically sub-divergent implies that the set divergence threshold has not been met. Divergence implies the degree to which the model diverges from the data and has usually been trichotomized to no, weak and strong divergence with cutoffs at p = 0.1 (10th percentile location) and p = 0.01 (1st percentile location). Preprint linked in the first post has been updated.

This is extremely problematic in my opinion, You are getting into dichotomania, adding confusing new terminology (what’s wrong with compatibility?), not solving the root problem, and inventing new words (the proper word is tritomy; think of the Greek root of dichotomy = dicho (two) + tomy (cut).). I personally would think deeper about how I want to spend my time.

2 Likes

Frank, none of this is of our making - everything already exists including the arbitrary cutoffs and the 95% interval and the p value. All we are attempting is to change the language and the only input we have is the absence of significance and confidence and the introduction of percentile interpretations. Also, this is a consensus view that leave everything as similar as possible except language so as much as I would like to drop p values and bring in real probabilities, the group does not believe that will change anything for doctors. There is nothing wrong with compatibility at all but the group believes that we need to unify the terminology. I have a different view and so does Paul Glasziou (he thinks ‘uncertainty’ is good just as Gelman proposes), but I think the residents and students are who we should listen to because experts have had their say for over a decade and nothing has changed at all. Perhaps when significance has been finally retired we can come back to other aspects, but it wont be retired by telling physicians that they need to move on as that has been tried for more than a decade and nothing has changed. I don’t know if this will work but it is different from what everyone else has tried to do. Also, its a very simple proposal so not much time required on this except for the discussions, which have been very helpful. I agree that non-divergent was a step too far even though it was exactly in line with the significant / non-significant divide we see so often today (I am not saying that it is right). Finally, trichotomized is a term I borrowed from Andrew Gelman (cited in the preprint).

Don’t borrow incorrect terms. I still think there are so many more valuable projects to work on. And when you transform a problematic measure the transformation process will not fix the underlying issue. Finally, I think it is a grand mistake to think that interpretations will be improved by this. Signing off for now, glad to have had a chance to express my opinions.

4 Likes

This 2023 tutorial on Bayesian methods is written at a level that should be understandable for most physicians:

https://link.springer.com/article/10.1007/s43441-023-00515-3#:~:text=Bayesian%20methods%20enable%20direct%20clinical,based%20on%20the%20observed%20data.

Any statistical explanations more complicated than this are likely to lose the vast majority of a clinical audience.

There’s a long way to go in educating clinicians, but the process has to start somewhere. Arguably, the first order of business is to acknowledge that, given competing educational priorities, very few of us are going to become expert in understanding any statistical method:

Nobody likes to admit when they don’t understand something and the idea of “outsourcing” appraisal of the evidence base that underpins our daily practice will make some people anxious. But just as we regularly turn to our specialist clinical colleagues when we are out of our depth and discourage patients from self-diagnosing based on Internet searches, physicians need to stick to what we do best.

Physicians will be most helpful to patients by honing our expertise in history-taking, physical exam, differential diagnosis, logical and targeted investigation, and managing uncertainty. Once a diagnosis is made, the treatments that we offer preferentially should be those that have strong supportive evidence from trials that were designed and analyzed with input from expert applied statisticians AND which reflect the patient’s unique individual context and values. This is a lot to do…

For most physicians, a basic understanding of differences between common statistical methods will suffice. With the exception of physician scientists, most will not benefit from acquiring a detailed understanding of how the Bayesian sausage is made. Importantly though, physicians would be justified, given countless examples of botched frequentist study design and interpretation, in worrying that even established applied statisticians might not accurately gauge their own expertise in applying Bayesian methods. To this end, clinicians will need to be able to easily access a pool of statisticians with deep proven expertise in Bayesian methods to help critically appraise any new wave of studies using these designs. Given the paucity of such experts, the most efficient places for their input would be as far “upstream” as possible in the knowledge creation and dissemination process: 1) in the design of clinical trials; and 2) in the vetting of Bayesian trials that get cited in commonly-used physician resources (e.g., UptoDate; practice guidelines).

3 Likes

This paper (linked in the first post) has now been reviewed by a prominent internal medicine journal and we had two reviewers - one positive and one negative. What is of interest to report here is that one of the reviewers felt that we were advocating dichotomy or trichotomy of p values as a way to help alleviate misconceptions regarding p values and “statistical significance” and the way this was stated finally made it clear to us what exactly was the problem with the message we were delivering.

Yes, different intervals of p values have been named, but these are nothing more than qualitative labels for an interval. For example, a p < alpha has a qualitative label “statistically divergent” - this is just a name for a p value in this range and nothing to do with us advocating that we should categorize p values or bin p values. We are just saying that since the qualitative labels are the norm, please call it a correct label that cannot be incorrectly used. We are not advocating categorization - we are advocating renaming the labels.

As an example in clinical medicine, serum hemoglobin below normal has qualitative labels such as Mild anemia: Hemoglobin 10.0 g/dL to lower limit of normal. Moderate anemia: Hemoglobin 8.0 to 10.0 g/dL. Severe anemia: Hemoglobin 6.5 to 7.9 g/dL. Do we then say that clinicians are guilty of dichotomizing or trichotomizing hemoglobin? These are just convenient terms that immediately allow inferences to be clear - shall I call the blood bank and say I need blood because the patient has a hemoglobin of 6 or because the patient has severe anemia? The 6 is already on the request - I do not need to say that as the label is more efficient and conveys my meaning. Shall we stop labeling hemoglobin levels? That is a different question and not in scope of our paper.

Anyway, the reviews have led us to a major understanding of what was unclear and we will be updating the preprint soon to clarify all this

Pointing out silliness that occurs when clinicians use thresholds with continuous measurements does not alleviate criticism of your proposed method for doing the same. Any method that has built-in thresholds for an underlying continuous process is very problematic. This is especially true when the thresholds are on the randomness scale, and not on a real direct evidence scale such as what Bayesian posterior probabilities provide.

1 Like

I think we have been misunderstood - we are not proposing the thresholds since they already exist and have names for the ranges induced by the threshold boundaries. All we are saying is that if threshold based intervals are to be given a name, please call it a sensible name. If anyone can succeed in replacing such intervals with something continuous and acceptable to the end user, then I am with them.

Unfortunately this perpetuates bad practice IMHO.

Update: Let me say that better. Having threshold for intervals is not optimal but is not as problematic as having thresholds for probabilities such as p-values. I need to keep in mind that you are emphasizing intervals.

2 Likes