Effect sizes for power and evidence for effectiveness

f2harrell · July 5, 2024, 2:43pm

The minimally clinically important difference (MCID) is frequently used to judge the value of medical treatments as well as to compute power in a frequentist design. It is reasonable to compute a sample size of a randomized trial so that there is a high probability (e.g., 0.9) that the null hypothesis of no treatment effect can be rejected when the true effect equals the MCID.

Bayesian effectiveness assessment is much different from frequentist assessments. One often captures evidence for benefit by computing the posterior probability that the treatment effect \Delta is in the right direction given the data and the prior, e.g., \Pr(\Delta > 0 | \text{data, prior}). For some studies, one may want to have evidence that the treatment effect exceeds the MCID, so it is reasonable to compute \Pr(\Delta > \delta | \text{data, prior}) where MCID=\delta. But what if a standard prior is used, the sample size is huge, and the posterior mean/median/model for \Delta equals \delta? The posterior probability of \Delta > \delta will be about 0.5, whereas many clinicians would consider the treatment worthwhile.

For Bayesian power calculations, it is often best to allow for uncertainty in MCID by examining evidence over a distribution of MCID. But it may be reasonable to desire a probability greater than 0.9 (Bayesian power) that the posterior probability will exceed a decision threshold (e.g., 0.95) were \Delta to exactly equal \delta.

Back to evidence for treatment effectiveness, it would be reasonable for one to compute \Pr(\Delta > \epsilon) where \epsilon < \delta. The value \epsilon could be called a threshold for a trivial treatment effect (TTTE) or a minimal noticeable (to the patient) or measureable treatment effect (MNTE). It may be reasonable to set \epsilon = \frac{\delta}{k} for some k such as 2 or 3. For the treatment effect is stated as an effect ratio (e.g., hazard or odds ratio) the division by k would be on the log ratio scale.

Once \epsilon is chosen one may compute \Pr(\Delta > \epsilon) to measure evidence for a non-trivial treatment effect. In this way, an observed difference at MCID=\delta will not result in an unimpressive posterior probability.

I hope that we can formalize these ideas, coming up with general definitions for MCID and TTTE/MNTE that are useful when using Bayes to get direct evidence for treatment benefit.

As an aside, it would be a mistake to crunch any of these measures in terms of how much patients improve from baseline. They should be couched in terms of what a parallel-group design does: estimates the difference in outcomes were the patient to get treatment A vs. were she to get treatment B.

Pavlos_Msaouel · July 7, 2024, 2:05am

This is great. I would add that in addition to full Bayes, which has the most mature literature (example webapp for RCT re-analysis here and datamethods thread here), the need to define the MCID and ε is also important for empirical Bayes (example webapp for oncology RCT re-analysis here and related preprint here), frequentist confidence distributions, as well as fiducial probability distributions and their more modern generalizations.

From a practical standpoint, in oncology we often define the MCID as hazard ratio (HR) = 0.8 as noted for example in this American Society of Clinical Oncology working group statement. This is a reasonable choice because, using the assumptions described here and the corresponding web calculator here, this will yield a milestone absolute risk reduction (ARR) of 5-10% under various baseline risks, e.g., in the adjuvant renal cell carcinoma scenario used in the related practical examples here and here.

Thus, on the logHR scale δ = ln(0.8). I favor calling ε the threshold for trivial treatment effect (TTTE) on the logHR scale and obtaining it by dividing δ / 2 and then exponentiate to obtain TTTE ≈ 0.894 which is pretty close to 0.9. One may also divide δ / 3 which will yield TTTE ≈ 0.928 which may be useful for some purposes, e.g., interim analyses for futility. But I like using MCID = 0.8 and TTTE = 0.9 because the MCID nicely corresponds to the less stringent 0.8 and 1/0.8 = 1.25 bioequivalence limits often used by the WHO and the FDA, whereas the TTTE corresponds to the also commonly used more stringent 0.90 and 1/0.9 = 1.11 bioequivalence interval.

f2harrell · July 7, 2024, 12:19pm

Thanks for adding so much to the discission Pavlos. I’d like to take this further by having the community define TTTE in more detail, perhaps stratified by continuous outcome vs. event outcomes. For continuous outcomes, I’m thinking of the triviality threshold as being either what is the smallest difference that can be reliably measured or the smallest difference that the patient can notice or for which a physiologic measure can detect an improvement (e.g., in a behavioral intervention in patients with mild-moderate renal dysfunction how much change in eating behavior results in any reduction in serum creatinine). Systolic blood pressure may be a good test case as we know more about that response variable than most. I think that SBP can be measured to within about 4mmHg. Perhaps shifts in SBP can also be measured with 4mmHg and TTTE may be defined as 4.

Having a good systematic default way to tie TTTE to MCID is very useful, but more specificity would also help.

The bottom line is we need to have an agreeable threshold for evidence that a treatment effect is better than trivial that, coupled with say a 0.95 highest posterior density uncertainty interval and where MCID is positioned within that interval, we would be comfortable saying that a treatment will improve clinical practice. It’s important to note that the current state of practice completely divorces clinical from statistical significance except during the moment that a power calculation is being done.

Pavlos_Msaouel · July 7, 2024, 1:22pm

Note here the rule of thumb, often used in psychology and patient-reported outcomes, to define the MCID for continuous outcomes as the SD / 2. Counterarguments here. Many such outcomes would more appropriately be treated as ordinal but empirical strategies can be used to convert them to interval.

Note also the connection in this setting of the TTTE with the patient acceptable symptom state (PASS), defined as ‘the value beyond which patients consider themselves well’. This minimum change from average could potentially be used to define the TTTE in these scenarios and aligns with your definition as the smallest noticeable difference.

f2harrell · July 7, 2024, 4:56pm

I think of PASS as something that is maybe even larger than MCID. And I resist using SD as a scale. Scaling by SD means we scale by how patients disagree with each other (i.e., biologic variability) not by the absolute impact on one patient.

R_cubed · July 7, 2024, 7:18pm

Before accepting the claims of psychometricians at face value, I recommend a careful study of the scholarship of Joel Michell, a psychologist and critic of psychometric methodology as “pathological science”

Michell, J. (2008). Is Psychometrics Pathological Science? Measurement: Interdisciplinary Research and Perspectives, 6(1–2), 7–24. https://doi.org/10.1080/15366360802035489

Pretty much all of his papers and responses to psychometric proponents are worth reading. Here is a response in Michell’s defense by mathematical psychologist Steven Barrett:

Barrett, P. (2008). The Consequence of Sustaining a Pathology: Scientific Stagnation—a Commentary on the Target Article “Is Psychometrics a Pathological Science?” by Joel Michell. Measurement, 6, 78-123. (PDF)

As Michell notes in his target article, the response from psychologists and psychometricians alike has been largely silence. From my own attempts to relay his thesis in papers, conferences, and academic seminars, I have been met with a mixture of disbelief, disinterest, anger, accusations of intellectual nihilism, ridicule, and generally a steadfast refusal to seriously question the facts. And, make no mistake, Michell deals with facts about measurement. The only coherent responses I have read where the author made a passable attempt at really engaging with Michell’s work was in the book Measurement: Theory and Practice by Hand (2004) a mathematical statistician and machine learning expert…

There are real-world consequences to his thesis, and we are living through them right now. The most important of these is that while psychometricians have advanced their thinking and technical sophistication in leaps and bounds over the past 40 years or so, the practical consequences have been almost nonexistent except in the domain of educational testing and various examination scenarios.

Outside of Barrett’s citations, the only other attempt I’ve seen to deal with this problem of needing interval/ratio quantities on inherently ordinal data is the work of Thomas Saaty – a mathematician who worked on multi-objective decision problems, and developed the analytic hierarchy process:

Saaty, T. L. (2008). Relative Measurement and Its Generalization in Decision Making Why Pairwise Comparisons are Central in Mathematics for the Measurement of Intangible Factors The Analytic Hierarchy/Network Process. Estadística, 102(2), 251-318.

More sources can be found in my old thread on treating ordinal as metric. Rasch is more sophisticated, but still has problems, that Michell addresses in other papers. From a stats POV, I don’t think these scores have the statistical properties the inventors think they do outside of very specific contexts.

https://discourse.datamethods.org/t/preprint-analysis-of-likert-type-data-using-metric-methods/5536

For a reasonable example of AHP in a health context, the following paper describes the creation of a Quality of Life questionaire

Prusak, A., & Sikora, T. (2017). THE APPLICATION OF THE ANALYTIC HIERARCHY PROCESS IN ASSESSING THE QUALITY OF LIFE. Center for Quality. (PDF)

davidcnorrismd · July 8, 2024, 11:51am

Just looking at this, and thinking it might be the way to formulate a larger problem within which to address the issues motivating this thread.

SimonKrenn · July 8, 2024, 3:40pm

Thank you for this thought-inspiring thread. Sorry for only asking (perhaps banal) questions. Maybe somebody more adept than myself can fill in some answers

Why wouldn’t this measure of clinical relevance be a continuum from MNTE to MCID and beyond as opposed to thresholds?
If it were continuous, why wouldn’t it be non-linear?
Shouldn’t multiple treatments and trials be directly comparable by where their evidence falls on such a respective relevance curve?
Shouldn’t a cost-to-benefit ratio be considered beyond just effectiveness for one outcome to make comparisons more grounded in an economic reality?
Shouldn’t all these factors be allowed to change over time and by new evidence?

f2harrell · July 8, 2024, 8:16pm

You raise a good point. We need to distinguish between estimation on the clinical scale, for which everything is on an effect magnitude continuum, reporting the whole posterior distribution, which is also on a continuum, and quantifying evidence for effects that are large enough to change clinical practice. More than one threshold may be used for this, and you can show this as a continuous graph with y=\Pr(\text{effect exceeds threshold}) and x= threshold. But what started this topic is a need to setting on one threshold for emphasis, i.e., the main take-home message about evidence for (non-trivial) effectiveness.

kiwiskiNZ · July 8, 2024, 9:20pm

hmmm… defining TTTE as a proportion of MCID is arbitrary. Therefore, could this just end up causing problems as an arbitrary p-value threshold has?

If we are to define it, could it be related somehow to the precision with which we can measure outcomes? e.g. for surrogate outcomes related to the precision we can measure the surrogates, or for hard outcomes like mortality to how accurate we believe our data collection sources and methods are.

davidcnorrismd · July 9, 2024, 1:09am

If patients can notice and judge the effect for themselves, or if the effect can be measured objectively, then an N-of-1 trial of the medical treatment may well be a rational choice where either treatment effects objectively or patients’ evaluation of them are heterogeneous. The freedom to undertake such a trial could thus improve clinical practice, even if widespread, uncritical use of the drug would be harmful or wasteful.

My point would be that a perceived “need” to dichotomize on a threshold invariably places itself at odds with the individual patient.

Indeed, the ‘need’ for FDA OCE’s Project Optimus to steer clear of “the practice of medicine, that the FDA does not regulate” and confine itself to “advocating a [single] dose for a population” unraveled in an analysis [1] that took inter-individual heterogeneity into account, and had to be walked back by CDER.

Any doctrine in pharmaceutical biostats that supposes it necessary to abstract very far from the individual patient or to adopt a black-box model of clinical practice is on shaky ground.

Norris DC. How large must a dose‐optimization trial be? CPT Pharmacom & Syst Pharma. Published online September 4, 2023:psp4.13041. doi:10.1002/psp4.13041

Pavlos_Msaouel · July 9, 2024, 6:05pm

As an update, here is now the preprint of the manuscript I referred to on the X / Twitter thread that started this topic. It uses k = 3 and δ = ln(0.8) to set ε and evaluate whether Bayesian early stopping rules using these effect sizes for oncology RCTs would be more efficient than the original frequentist analyses.

f2harrell · July 9, 2024, 10:35pm

Great to see this Pavlos. In the simulated trials I did here I used a slightly different setup where I stopped early for efficacy if there was a very high Pr(any effect) and a moderately high Pr(more than trivial effect). I also added a precision criterion. I stopped early for inefficacy if there was a very high probability of the true effect being trivial or worse. And I think posterior probability cutoffs should perhaps be bigger than what you used.

davidcnorrismd · July 12, 2024, 11:29am

I note the Working Group Statement you mention [Sci-Hub link] says the following:

The conclusions reached by the working groups are not intended to set standards for regulatory approval or insurance coverage but rather to encourage patients and investigators to demand more from clinical trials.

In that spirit, and amid the search for a ‘natural scale’ for TTTE that underlies many of the comments in this thread, I’d suggest that the issue of treatment burden should take primacy over whatever accidents of history or physiology have determined the current state of our outcome measurement technology.

Thus, the TTTE question might be an ideal point at which to involve patient advocates in trial design, and not a mere technical problem to be resolved ex ante by statisticians.

f2harrell · July 12, 2024, 11:33am

It’s a good point; on the other handd I think of TTTE as sometimes being related to noticeable physiologic response and to measurement error, whereas MCID might be more related to patient burden?

davidcnorrismd · July 12, 2024, 11:45am

“Trivial” sounds to me like a value judgment: “That sounds like a trivial benefit to me, Doctor, for all that driving back-and-forth to the infusion center.”

Maybe “Unmeasurable” better captures the more purely technical character of our current measurement capabilities? Maybe this all needs to begin with a taxonomy of various concerns, which could be rigorously analyzed and separated.

f2harrell · July 12, 2024, 2:54pm

I want to say “unnoticeable” but there is a better term out there somewhere.

davidcnorrismd · July 12, 2024, 4:33pm

Undetectable has an appropriately technical connotation, and concordance with usage in phrases like “lower limit of detection” etc.

Pavlos_Msaouel · July 12, 2024, 5:25pm

Good point re: detectable / undetectable.

kiwiskiNZ · July 14, 2024, 9:58pm

hmmm… While that is something like I was trying to get at previously, we are still arbitrary (the definition of LoD (& LoB) depends on an arbitrary value - see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2556583/ for those not familiar with it).
My 2c:
Unmeasureable - I don’t like this as everything can be measured, even if imprecisely.
Unoticeable/Undetectable - must be relative to something.

Maybe “Uncertain Treatment Effect” which is of not obvious clinical benefit. Uncertain, because we are dealing with measurement uncertainty, and the qualification because it is always possible that our measurements are so poor, they can hide a clinical benefit. An example would be that prior to the advent of high-sensitivity troponin, the (contemporary as they are now called) troponin assays could not detect changes in troponin concentration below their LoDs that we now recognise as of clinical importance.