Time for a change in language: statistical divergence and the 95% non-divergent interval (95%nDI)

Dear Dr. Doi,

I am glad that you found my suggestions useful. I don’t fully agree with everything you wrote, but I completely share and support your intention to find the right balance between accuracy and effective communication.

In this regard, I feel compelled to emphasize one point: you write

“[…] what we want them to do is to ask the simple question “divergence from what?”

I use your argument, which I agree with, to reinforce the following point regarding less-divergence intervals: what we want them to do is to ask the simple question “less divergent than what?” (answer: less divergent - from the experimental data - than the hypotheses outside the interval as assessed by the chosen model). And, subsequently, to the question “how much less divergent?” (answer: P ≥ 0.05 for the 95% case, P ≥ 0.10 for the 90% case, etc.)

Regarding the S-values, I understand the intent to avoid overcomplicating things. Nonetheless, I suggest reading the section “A less innovative but still useful approach” in a recent paper I co-authored with Professors Mansournia and Vitale (we have to thank @Sander , @Valentin_Amrhein , and Blake McShane for their essential comments on this point) [1]. Specifically, what concerns me is correctly quantifying the degree of divergence, which can be explained intuitively without the underlying formalization of the S-values***.

All the best and please keep us updated,
Alessandro

*** P.S. What is shown in Table 2 of the paper can be easily applied to individual P-values by considering their ratio with P = 1 (P-value for the non-divergent hypothesis). For example, P = 0.05 → 1/0.05 = 20. Since 20 lies between 16 = 2^4 and 32 = 2^5, we know that the observed statistical result is more surprising than 4 consecutive heads when flipping a fair coin but less surprising than 5 (compared to the target hypothesis conditionally on the background assumptions).

REFERENCES

[1] https://doi.org/10.1016/j.gloepi.2024.100151

1 Like

Thanks for the update. This is indeed a good way of thinking about the interpretation of the p value. What do you think about the location interpretation and saying that the test statistic from the data had a 5th percentile score under the target model? Do you see this as equally intuitive?

The other points you have raised are being discussed and we will see what the consensus is.

Dear Dr. Doi,

The point I wish to emphasize here is not only the intuitive aspect (which, obviously, has a margin of subjectivity) but the statistical information contained. Indeed, in the descriptive approach (highly advisable in the soft sciences), the interpretation of the P-value as “the probability of obtaining - in numerous future equivalent experiments in which chance is the sole factor at play - test statistics as or more extreme than the one observed in the experiment” falls away (since it is implausible - if not impossible - to guarantee that equivalence). All it can ‘tell us’ in this mere sense is how surprised we should be by a numerical result compared to the model’s prediction.

In this regard, since information does not scale linearly but logarithmically, using the P-value ratio also leads to a distortion of the contained information. For example, P1 = 0.02 might seem to indicate a much higher condition of rarity than P2 = 0.08 (since P2/P1 = 4); however, these are two events whose ‘difference in surprise’ is equivalent to only 2 consecutive heads when flipping a fair coin.

Therefore, I don’t believe that the expression “the test statistic from the data had a 5th percentile score under the target model” is sufficiently intuitive from this standpoint (although it accurately expresses the location on the curve of the associated distribution). Indeed, the key question remains: how much refutational, statistical information is contained in that percentile?

A possible compromise could be to publish a pre-study protocol defining various ranges of compatibility/surprise (I recently published on this topic: https://doi.org/10.18332/pht/189530) for P-values (attached image). Obviously, like you and your colleagues, I am also experimenting to find a solution to the misuse of statistical significance.

From my knowledge and personal impressions, I can say that I find your idea valid, provided that the issues I raised are adequately addressed (not necessarily through my suggestions alone, but also through other alternative approaches, perhaps better than mine!)

All the best,
Alessandro

image

1 Like

Thanks Dr Rovetta,

These are valid points and the dilemma now is where do we go from here regarding two main points - naming the interval in language that is unambiguous in terms of interpretation and using correct language regarding p values that makes the interpretation explicit and meaningful as has been attempted in your Table above. I have discussed these discussions on the blog with methodologists in the group (especially around naming the interval although this has flow on implications for what we call statistically ‘non-divergent’ too) and the recommendations vary considerably amongst them understandably so because we each have our own cognitive biases. The group also consists of a large number of Y3 and Y4 medical students as well as doctoral students and residents and I am meeting up with them later today. I think the most pragmatic plan is to give them the options and leave it to them as they represent the next generation users of these terms. We could of course move this in a particular direction, but since we remain uncertain of what will eventually work its best that an unbiased group pitch in now that all discussions have happened both here and on RG - which they also are reading and generate ideas that are convincing. Once this meeting is held, if there are any changes adopted, I will update the results here as well as in an update to the preprint on RG.

I hope you’ll give them Bayesian alternatives that will be simpler for them to understand and that don’t require inference or hypothesis testing. You can just think of playing the odds in decision making. You don’t have to conclude that a treatment works; you might act as though it works and compute the probability that you are taking the wrong action, i.e., one minus the posterior probability of efficacy.

Hi Suhail

For what it’s worth (probably not much :slight_smile: ), as a family physician without extensive training in epi/stats, I had trouble understanding the new proposal. I did, however, find the explanations in the following article to be quite clear. S values seem at least somewhat intuitive to non-experts:

I suspect that any concepts more complicated than S values will be hard to teach to an audience without a graduate degree in epi or stats.

Bayesian presentations of trial results often, but not always https://discourse.datamethods.org/t/multiplicity-adjustments-in-bayesian-analysis/2649), feel easier to grasp. But I’m not sure that there are enough instructors who are expert enough with these methods (“how the sausage is made”) to teach them correctly to a clinical audience (?) More worryingly, until a sufficient number of researchers deeply understand these methods, there’s a risk that they will produce studies that are ostensibly easier to interpret, yet clinically misleading (though there’s already plenty of bad practice in presenting frequentist results). Not an easy problem…

1 Like

I feel that way of thinking is not helpful because they misunderstand traditional statistics even worse. For examples many clinical trialists and NEJM editors still don’t understand that a large P value does not mean the treatment doesn’t work.

1 Like

Hi Frank and Erin,
For now I am sticking with terminology around p values and CIs because it is so badly (mis)embedded in the thinking about EBM that this is the priority. Perhaps the time will come to extend this to the ways people reason from data - when terminology has become clear. I agree that the association of a large p value with no effect is very common in Medicine but it is more of expectation than anything to do with reasoning from data. This is so deeply misunderstood that even saying this would be akin to saying that a normal serum sodium concentration is 120mmol/L. People will just get up and walk out of the room. The difference is that in the case of p values they do not know that they don’t know because bad choice of terms continually reinforces what is wrong.

P-values and null hypothesis testing are defective concepts. That’s why we go round and round with workarounds, alternate wording, etc. Time to give up. Think of what a posterior probability P and 1-P give you - a built-in real error probability.

1 Like

Whether these concepts are intrinsically defective seems to be a point of endless debate among experts. But there’s no arguing that any method/concept that is so convoluted/counterintuitive that it is still misapplied/misinterpreted by a large fraction of researchers and readers, even after decades of use, is practically defective.

Bayesian framing of study results often feels more straightforward and less vulnerable to misinterpretation by non-expert readers. For this reason, it’s heartening to see more publications using these methods. But I do wonder about the extent to which Bayesian methods themselves are vulnerable to misapplication by sub optimally-trained researchers (?) Any advantage that Bayesian methods might have with regard to ease of interpretation of results could easily be nullified by inexpert Bayesian study design.

Questions:

  1. To what extent are Bayesian methods currently emphasized within undergraduate and graduate statistics programs?

  2. What proportion of already-credentialed professional applied statisticians would you estimate are competent at designing trials using Bayesian methods?

  3. Assuming that a sufficient proportion of professional statisticians becomes expert in Bayesian methods by a certain future date, do you foresee that this teaching could, in turn, be propagated correctly to those who usually provide instruction to a clinical audience? Often, those who instruct clinicians on critical appraisal have some epidemiology or stats training, but primarily focus on clinical practice.

  4. If the answer to question 3 above is “no,” then do you anticipate that clinicians will simply need to concede ignorance on certain methodologic aspects of critical research appraisal (arguably, many of us already act like we understand frequentist methods when we really don’t…)? Would it be best for clinicians to limit their role in the design and appraisal of studies using Bayesian methods to certain aspects of the process (e.g., prior elicitation?), and to forgo any attempt to understand the nittier/grittier parts?

1 Like

Frank what do you think about this comment from Senn:

I am in perfect agreement with Gelman’s strictures against using the data to construct the prior distribution. There is only one word for this and it is cheating'. But I would go further and draw attention to another point that George Barnard stressed and that is that posterior distributions are of little use to anybody else when reporting analyses. Thus if you believe (as I do) that far more use of Bayesian decision analysis should be made in clinical research it does not follow that Bayesian analyses should be used in reporting the results. In fact the gloomy conclusion to which I am drawn on reading (de Finetti 1974) is that ultimately the Bayesian theory is destructive of any form of public statistics. To the degree that we agreed on the model (and the nuisance parameters) we could communicate likelihoods but in practice all that may be left is the communication of data. I am thus tempted to adapt Niels Bohr's famous quip about quantum theory to say that, anyone who is not shocked by the Bayesian theory of inference has not understood it’.

RE: Senn’s reported quote. Can you provide a source for this, or is it a private communication?

FWIW, I’d strongly disagree with it, and I’m surprised an experienced statistician such as Senn would express it.

The problem with this attitude is that the only frequentist procedures we care about are the “admissible” ones, which coincide with the Bayesian procedures.

Whether they like it or not, Frequentists (to the degree they are rational) are Bayesian. It is also true that Bayesians (when they are planning experiments) are also Frequentists.

As for:

I’d argue the existence of public option markets is an obvious and direct refutation to this claim. All sorts if intangible (aka “subjective”) information is incorporated into the price of assets and risk transfer mechanisms that permit the very scientific apparatus statisticians are apart of, to be funded.

Seymour Geisser noted long ago in:

Geisser, S. (1975). Predictivism and sample reuse. University of Minnesota. (link)

The fundamental thesis of this paper is that the inferential emphasis of Statistics, theory and concomitant methodology, has been misplaced. By this is meant that the preponderance of statistical analyses deals with problems which involve inferential statements concerning parameters. The view proposed here is that this stress should be diverted to statements about observables. …

This stress on parametric inference made fashionable by mathematical statisticians has been not only a comfortable posture but also a secure buttress for the preservation of the high esteem enjoyed by applied statisticians because exposure by actual observation in parametric estimation is rendered virtually impossible. Of course those who opt for predictive inference i.e. predicting observables or potential observables are at risk in that their predictions can be evaluated to a large extent by either further observation or by a sly client withholding a random portion of the data and privately assessing a statistician’s prediction procedures and perhaps concurrently his reputation. Therefore much may be at stake for those who adopt the predictivistic or observabilistic or aparametric view.

Geiser points out in the introduction to his book Predictive Inference, that:

…only the Bayesian mode is always capable of producing a probability distribution for prediction (my emphasis).

This implies to me that Bayesian methods (whether Subjective Bayesian that Senn is implicitly condemning or Objective Bayes, which is the best a Frequentist can hope for without good prior information) are more receptive to public challenge and verification than the limited frequentist methods taught in the B.C. era (Before Computing).

The caricature of frequentist methods taught in medical statistics is at best, a fun house image of legitimate non-Bayesian procedures (ie those which emphasize the generation of a valid likelihood function or set of coverage intervals which are isomorphic) that focus on protocols for data collection in the case of a priori uncertainty. An important step forward in presenting Frequentist theory (via the construction of “confidence distributions”) is:

I recommend it highly, whether your statistical philosophy is Bayesian or not.

I’d like to know which parts of Senn’s quote you think are erroneous.

It is not clear that Bayesian have to be frequentist at any stage. As shown here you can be fully Bayesian when designing experiments.

Back to the problem at hand, here are my hypotheses:

  • Assumptions of Bayesian inference are easier to teach than sampling theory
  • Interpretation of Bayesian analysis is far easier to teach than frequentist analysis
  • Bayesian mechanics is harder to teach, but the advances in software over the past 3 years is stunning and helps here
  • Cheating with Bayes is easier to detect
  • Senn’s quote about Bayesian being truly shocking if you understand it is spot on. It is so shocking that I know of many PhD statisticians with > 10 years of real experience still confuse Pr(A|B) with Pr(B|A), not even realizing that they are engaging in transposed conditionals in their clinical trial interpretations. One very senior statistician said at a meeting that you have to know \alpha to know the chance that the approved drug doesn’t work. This is exactly what \alpha isn’t.
  • Once you accept Bayes as a standard for quantifying evidence for simple assertions (e.g. the treatment works in the right direction) you immediately gain the benefits of an easy way to answer what’s really needed, e.g., Pr(treatment benefits the majority of 5 clinical endpoints). That is impossible to do with a frequentist approach.
4 Likes

Quick update:
After the groups meeting and consideration of RG and datamethods discussions the consensus reached was to call the interval a sub-divergent interval (sDI) in lieu of a non-divergent interval. In summary the linguistic use is as follows: Statistically divergent implies disagreeing with the target model as the source of the data at the set divergence threshold e.g. p=0.05 or 5th percentile location. Statistically sub-divergent implies that the set divergence threshold has not been met. Divergence implies the degree to which the model diverges from the data and has usually been trichotomized to no, weak and strong divergence with cutoffs at p = 0.1 (10th percentile location) and p = 0.01 (1st percentile location). Preprint linked in the first post has been updated.

This is extremely problematic in my opinion, You are getting into dichotomania, adding confusing new terminology (what’s wrong with compatibility?), not solving the root problem, and inventing new words (the proper word is tritomy; think of the Greek root of dichotomy = dicho (two) + tomy (cut).). I personally would think deeper about how I want to spend my time.

2 Likes

Frank, none of this is of our making - everything already exists including the arbitrary cutoffs and the 95% interval and the p value. All we are attempting is to change the language and the only input we have is the absence of significance and confidence and the introduction of percentile interpretations. Also, this is a consensus view that leave everything as similar as possible except language so as much as I would like to drop p values and bring in real probabilities, the group does not believe that will change anything for doctors. There is nothing wrong with compatibility at all but the group believes that we need to unify the terminology. I have a different view and so does Paul Glasziou (he thinks ‘uncertainty’ is good just as Gelman proposes), but I think the residents and students are who we should listen to because experts have had their say for over a decade and nothing has changed at all. Perhaps when significance has been finally retired we can come back to other aspects, but it wont be retired by telling physicians that they need to move on as that has been tried for more than a decade and nothing has changed. I don’t know if this will work but it is different from what everyone else has tried to do. Also, its a very simple proposal so not much time required on this except for the discussions, which have been very helpful. I agree that non-divergent was a step too far even though it was exactly in line with the significant / non-significant divide we see so often today (I am not saying that it is right). Finally, trichotomized is a term I borrowed from Andrew Gelman (cited in the preprint).

Don’t borrow incorrect terms. I still think there are so many more valuable projects to work on. And when you transform a problematic measure the transformation process will not fix the underlying issue. Finally, I think it is a grand mistake to think that interpretations will be improved by this. Signing off for now, glad to have had a chance to express my opinions.

4 Likes

This 2023 tutorial on Bayesian methods is written at a level that should be understandable for most physicians:

https://link.springer.com/article/10.1007/s43441-023-00515-3#:~:text=Bayesian%20methods%20enable%20direct%20clinical,based%20on%20the%20observed%20data.

Any statistical explanations more complicated than this are likely to lose the vast majority of a clinical audience.

There’s a long way to go in educating clinicians, but the process has to start somewhere. Arguably, the first order of business is to acknowledge that, given competing educational priorities, very few of us are going to become expert in understanding any statistical method:

Nobody likes to admit when they don’t understand something and the idea of “outsourcing” appraisal of the evidence base that underpins our daily practice will make some people anxious. But just as we regularly turn to our specialist clinical colleagues when we are out of our depth and discourage patients from self-diagnosing based on Internet searches, physicians need to stick to what we do best.

Physicians will be most helpful to patients by honing our expertise in history-taking, physical exam, differential diagnosis, logical and targeted investigation, and managing uncertainty. Once a diagnosis is made, the treatments that we offer preferentially should be those that have strong supportive evidence from trials that were designed and analyzed with input from expert applied statisticians AND which reflect the patient’s unique individual context and values. This is a lot to do…

For most physicians, a basic understanding of differences between common statistical methods will suffice. With the exception of physician scientists, most will not benefit from acquiring a detailed understanding of how the Bayesian sausage is made. Importantly though, physicians would be justified, given countless examples of botched frequentist study design and interpretation, in worrying that even established applied statisticians might not accurately gauge their own expertise in applying Bayesian methods. To this end, clinicians will need to be able to easily access a pool of statisticians with deep proven expertise in Bayesian methods to help critically appraise any new wave of studies using these designs. Given the paucity of such experts, the most efficient places for their input would be as far “upstream” as possible in the knowledge creation and dissemination process: 1) in the design of clinical trials; and 2) in the vetting of Bayesian trials that get cited in commonly-used physician resources (e.g., UptoDate; practice guidelines).

3 Likes

This paper (linked in the first post) has now been reviewed by a prominent internal medicine journal and we had two reviewers - one positive and one negative. What is of interest to report here is that one of the reviewers felt that we were advocating dichotomy or trichotomy of p values as a way to help alleviate misconceptions regarding p values and “statistical significance” and the way this was stated finally made it clear to us what exactly was the problem with the message we were delivering.

Yes, different intervals of p values have been named, but these are nothing more than qualitative labels for an interval. For example, a p < alpha has a qualitative label “statistically divergent” - this is just a name for a p value in this range and nothing to do with us advocating that we should categorize p values or bin p values. We are just saying that since the qualitative labels are the norm, please call it a correct label that cannot be incorrectly used. We are not advocating categorization - we are advocating renaming the labels.

As an example in clinical medicine, serum hemoglobin below normal has qualitative labels such as Mild anemia: Hemoglobin 10.0 g/dL to lower limit of normal. Moderate anemia: Hemoglobin 8.0 to 10.0 g/dL. Severe anemia: Hemoglobin 6.5 to 7.9 g/dL. Do we then say that clinicians are guilty of dichotomizing or trichotomizing hemoglobin? These are just convenient terms that immediately allow inferences to be clear - shall I call the blood bank and say I need blood because the patient has a hemoglobin of 6 or because the patient has severe anemia? The 6 is already on the request - I do not need to say that as the label is more efficient and conveys my meaning. Shall we stop labeling hemoglobin levels? That is a different question and not in scope of our paper.

Anyway, the reviews have led us to a major understanding of what was unclear and we will be updating the preprint soon to clarify all this

Pointing out silliness that occurs when clinicians use thresholds with continuous measurements does not alleviate criticism of your proposed method for doing the same. Any method that has built-in thresholds for an underlying continuous process is very problematic. This is especially true when the thresholds are on the randomness scale, and not on a real direct evidence scale such as what Bayesian posterior probabilities provide.

1 Like