Dichotomization

EvZ · February 19, 2025, 11:27am

I’m writing a paper about “An Empirical Assessment of the Cost of Dichotomization” together with Frank Harrell (@f2harrell) and Stephen Senn (@Stephen). In this paper, we quantify the information loss due to dichotomization in (clinical) trials, which is also called “responder analysis”. It’s not submitted yet, so we would very much appreciate your comments!

Many statisticians, including Frank and Stephen, have objected to dichotomization, see Categorizing Continuous Variables and Statistical Errors in the Medical Literature: Dichotomania. Unfortunately, so far, this has not had the desired result as the majority of clinical trials continue to have low-information binary outcomes (Figure 2 below). We hope to add some new ways to dissuade clinical researchers from dichtomizing continuous outcomes.

We use of the following mathematical fact (which is not new):

Suppose the continuous outcome has the normal distribution. Then both the standardized mean difference (SMD) and the probit transformation of the dichotomized outcome are estimators of Cohen’s d.

You can see this equivalence in Figure 1 below. We use it to calculate by how much the sample size may be reduced if the outcome would not be dichotomized. We hope that this will motivate researchers during the planning phase not to dichotomize.

We also provide a method to calculate the loss of information after a responder analysis has been done. We hope that this will motivate researchers to abandon dichotomization in future trials.

Finally, we will use 21,435 unique trials from the Cochrane Database of Systematic Reviews to study the loss of information due to dichotomization empirically. We can clearly see that researchers do tend to increase the sample size to compensate for the low information content of binary outcomes, but not sufficiently. We show the loss of statistical power in Figure 5.

Cohen’s d, the SMD and the probit transformation

The Figure below is meant to explain the equivalence between Cohen’s d and the probit transformation. The top panel of the Figure shows two normal distributions with unit standard deviations. The distributions are shifted by some distance d (Cohen’s d). The shaded areas are the probabilities when we dichotomize at some cut-off. The bottom panel shows the associated quantiles of the standard normal distribution. The distance between these quantiles is the probit transformation, which is equal to d. This correspondence does not depend on the choice of the cut-off.

Figure 1 Cohen’s d and the probit transformation.

With the continuous outcome, we can estimate Cohen’s d with the standardized difference of means, which is SMD = \frac{m_1 - m_2}{s}, where m_1 and m_2 are the sample averages and s is the pooled sample standard deviation.

With the dichotomized outcome, we can estimate Cohen’s d with the probit transformation of the responder proportions, which is PBIT = \Phi^{-1}(p_2) - \Phi^{-1}(p_1), where \Phi^{-1} is the quantile function of the standard normal distribution.

Information loss

It is clear that dichotomization causes some loss of information. The extent of this loss depends on the cut-off, with more imbalance between responders and non-responders resulting in a greater loss of information.

If the continuous outcome has the normal distribution, then both PBIT (=the probit transformation of the responder proportions) and SMD are estimating the same population parameter, namely Cohen’s d. Therefore, we can quantify the loss of information by comparing their (sampling) variances. The ratio of the sampling variances is known as the relative efficiency, which we denote by R.

The relative efficiency may be interpreted as the proportion of information that is retained after dichotomization. Hence, 100\% \times (1-R) is the percentage of information lost. R may also be interpreted in terms of a reduction of the sample size. Recall that the sampling variance is inversely proportional to the sample size; if we double the sample size, then the sampling variance is halved. Therefore, the factor by which the sample size would have to be increased to compensate for the loss of information is equal to 1/R.

R is maximal when the numerical observations are dichotomized into groups of equal size and that in that case it is equal to 2/\pi which is approximately 0.64. So, even in the most favorable case, the loss of information due to dichotomization would have to be compensated by an increase in sample size by a factor 1/0.64=1.57.

We do not need the original, continuous outcome to assess the information loss due to dichotomization. The responder proportions in both arms are enough to make an approximation. For example, the CHEST-1 trial had a continuous endpoint. In a follow-up analysis, this endpoint was dichotomized for a responder analysis. After dichotomization, there were 92 responders among 173 in the treated group, and 21 responders among 92 subjects in the control group.

The following R code computes the probit transformation PBIT, its sampling variance, the approximate sampling variance of the SMD and the relative efficiency.

rel_eff = function(events1,events2,n1,n2){
  p1=events1/n1; z1=qnorm(p1)
  p2=events2/n2; z2=qnorm(p2)
  PBIT=z2-z1
  v_PBIT=2*pi*p1*(1-p1)*exp(z1^2)/n1 + 2*pi*p2*(1-p2)*exp(z2^2)/n2
  v_SMD=(n1+n2)/(n1*n2) + PBIT^2/(2*(n1+n2))
  rel_eff=v_SMD/v_PBIT
  data.frame(PBIT,v_PBIT,v_SMD,rel_eff)
}

Running this code, we get

So, the dichotomization effectively caused a reduction of the sample size by a factor 0.6. In other words, the information loss due to dichotomization would have to be compensated by an increase in sample size by a factor 1/0.6=1.67.

Sample Size Calculations

Suppose that somebody is planning a trial where a continuous outcome is to be dichotomized to determine the responders and non-responders to some treatment. In that case, the sample size calculation will be based on the assumed responder probabilities in the two groups.

As an example, suppose that we expect the responder probabilities to be p_1=0.3 and p_2=0.5. We calculate the required sample size to have 80% power.

No additional information is needed to also compute the required sample size if the outcome would not be dichotomized. We start by applying the probit transformation to the responder probabilities. Assuming normality, this is the same as Cohen’s d. Therefore we can now run a sample size calculation for a two-sample t-test, setting the difference of means to equal to Cohen’s d and the standard deviation equal to 1. We find

So, in this example, the t-test requires 59/93=0.63 times fewer subjects than the proportions test.

Cochrane Database of Systematic Reviews

The Cochrane Database of Systematic Reviews (CDSR) is arguably the most comprehensive collection of evidence on medical interventions. We use the primary efficacy outcomes of randomized controlled trials (RCTs) from the CDSR. We removed trials with fewer than 10 or more than 1000 participants. Other filtering steps, which may be inspected in the online supplement of the paper, resulted in the primary efficacy results from 21,435 unique randomized controlled trials (RCTs). Of these trials, 7,224 (34%) have a continuous (numerical) outcome and 14,211 (66%) have a binary outcome. This proportion changes over time, but has always been very high. It seems to have reached a minimum around 2010 and then stayed more or less constant (Figure 2).

Figure 2: The proportion of trials with a binary outcome.

The median sample size of the RCTs with a continuous outcome is 58 (IQR 33 to 104). For binary trials, the median sample size is 96 (IQR 51 to 197). Among trials with a continuous outcome, 38% are statistically significant. Among those with a binary outcome, 25% are statistically significant. Thus, despite having much larger sample sizes, many fewer binary trials reach statistical significance. This suggests that researchers do increase the sample size to account for the fact that binary outcomes are less informative than continuous ones, but not sufficiently. We will now study this in more detail.

We used natural regression splines with three degrees of freedom to regress the standard error on the square root of the sample size (Figure 3, left panel). We see that for any given sample size, trials with a binary outcome have less information than those with a continuous outcome. We also regressed the binary event whether a trial was statistically significant on the square root of the sample size (Figure 3, right panel). We conclude from Figure 3 that for any given sample size, RCTs with continuous outcomes carry more information and have greater statistical power than those with binary outcomes.

Figure 3: Left panel: The standard error versus the square root of the sample size. Right panel: The proportion of significant trials versus the square root of the sample size.

Next, we used natural regression splines with three degrees of freedom to regress the square root of the sample size on the magnitude of the estimated effect (Figure 4, left panel). This effect is the SMD for the continuous trials, and the PBIT (=probit transformation of the responder proportions) for the binary trials. Similarly, we also regressed the standard error on the magnitude of the estimated effect (Figure 4, right panel). At any given effect size, we see that the average sample size of the binary trials is larger than the average sample size of the continuous trials. However, the increased sample size is not enough; the standard error of the binary trials is still larger on average than that of the continuous trials.

Figure 4: Left panel: The square root of the sample size versus the magnitude of the estimated effect. Right panel: The standard error versus the magnitude of the estimated effect. For trials with a continuous outcome, the effect is the SMD. For trials with a binary outcome, the effect is the PBIT.

Figure 4 demonstrates that researchers do increase the sample size to compensate for the fact that binary outcomes are less informative than continuous outcomes, but not sufficiently. Consequently, trials with binary outcomes have lower power than trials with continuous outcomes. This is demonstrated in Figure 5, where we regressed the binary event whether a trial was statistically significant on the estimated effect size (SMD or PBIT).

Figure 5: The proportion of significant results versus the magnitude of the effect. For trials with a continuous outcome, the effect is the SMD. For trials with a binary outcome, the effect is the PBIT.

Finally, we used the rel_eff() function from the first part of this blog post to approximate the relative efficiency of all the RCTs with a binary outcome (Figure 6). We find that many trials have a relative efficiency near the statistically optimal value of 0.64. This may indicate that the cut-off is sometimes not chosen for clinical relevance, but rather to limit the loss of statistical power. There are also many trials where the responders and non-responders are not well balanced, and consequently the loss of information is much greater.

Figure 6: The distribution of the relative efficiency of the trials with a binary outcome from the CDSR.

The information loss we see in Figure 6 is all the more serious as two thirds of the randomized trials in the CDSR have a binary outcome as the primary efficacy endpoint. Of course, not all trials with a binary outcome result from dichotomization of a continuous outcome. However, if a trial does involve dichotomization, then the loss of information is avoidable.

Discussion

It is believed by some that categorization of noisy continuous variables reduces measurement error. In fact, the opposite is true. For example, a systolic blood pressure (SBP) threshold of 140 mmHg leads to classification of an observed SBP of 141 mmHg as hypertensive. If the true SBP were actually 139 mmHg, the resulting misclassification represents an error of the worst kind. A continuous analysis would also be affected, but not nearly as much (just by 141 versus 139).

The magnification of the noise around the cut-off is part of the reason why dichotomization leads to a loss of information. The other part is that all measurements above (or below) the cut-off are lumped together. For example, the difference between 141 mmHg and 180 mmHg is lost. The latter is referred to as a “hypertensive crisis” which is a medical emergency.

Some have argued that dichotomization leads to effect sizes (such as risk differences, numbers needed to treat, risk ratios and odds ratios) that are clinically more relevant or more easily interpreted than the effect sizes associated with continuous outcomes (such as difference of means or standardized difference of means). We do not find this argument compelling, but ultimately the question of clinical relevance and ease of interpretation may remain a matter of opinion. The empirical evidence from the CDSR, however, is unambiguous. It is quite clear that trials with binary endpoints have larger sample sizes on average than trials with continuous endpoints, while a lower proportion reaches statistical significance.

ESMD · February 20, 2025, 1:21am

A few comments from a family physician with very little formal training in statistics [so take them with a big grain of salt… ]:

“Unfortunately, so far, this has not had the desired result as the majority of clinical trials continue to have low-information binary outcomes (Figure 2 below). We hope to add some new ways to dissuade clinical researchers from dichotomizing continuous outcomes.”

The first question I would ask is “WHY have previous educational efforts failed to discourage the widespread practice of dichotomization?” Possible reasons:

Insufficient expertise among trial statisticians- maybe many/most are unaware of the pitfalls of dichotomization (?); and/or
Disagreements among expert statisticians that dichotomization has a meaningful negative impact on trial power (in which case, maybe an empiric demonstration of the impact could help to convince “hold-out” statisticians, if similar demonstrations haven’t been published previously); and/or
General agreement among trial statisticians that dichotomization is a suboptimal practice, but inability of statisticians to convince trial clinicians, in a language that they can understand, that it is a major problem.

Once you have decided which of the above explanations is most likely, then you can decide how best to adjust the content of your article. In other words, it’s important to be clear about your target audience.

If your goal is to convince as-yet-unconvinced trial statisticians not to dichotomize when they design a trial, then a complex, math- and symbols-heavy presentation (like the one you’ve outlined above) would likely be fine. But if your goal is to convince clinicians not to dichotomize (i.e., that they need to listen to their trial statistician’s advice ), you will need to…how do I say this without offending esteemed colleagues…“dumb things down” considerably- to be clear, I’m using myself as a benchmark here…

If you try to convince clinicians of the wrongness of their approach using terms like “probit transformation,” nearly all will reflexively curl up into the fetal position. When they eventually uncurl, they will, feeling completely overwhelmed, insist that things be done the way they’ve always been done (according to other publications in their field). If you want to convince them to change their ways, you will need to speak to them in terms they can understand. Many will fear admitting they don’t understand and many will pretend they do understand in order to save face. To this end, I wonder if there’s a way that you could explain the pitfalls of dichotomization in a purely narrative way (without graphs), using simple everyday analogies (“explain it to me like I’m five years old”…).

A final option, if you’re not sure who your target audience should be, would be to present two versions of your article- one for an expert statistical audience (perhaps the version you have presented here) and the other for a lay clinical audience.

“In this paper, we quantify the information loss due to dichotomization in (clinical) trials, which is also called “responder analysis” .

Maybe I just don’t understand these concepts deeply enough (the most likely explanation), but this is the first time I’ve seen the terms “dichotomization” and “responder analysis” used synonymously. I always thought these concepts were separate, yet somewhat related to each other…The only way I can make sense of your conflation of these terms is by considering that the act of “classifying” the level of a patient’s biomarker at the end of a trial (i.e., defining where the result lies relative to some defined cut-point) and comparing it to the level of his biomarker at the beginning of the trial implicitly promotes the mistaken idea that any category change that occurred for that patient during the trial was “caused” by the treatment being tested (?) But, as noted in this other thread Causal inferences from RCTs- could “toy” clinical examples promote understanding?, causality at the level of an individual patient can NOT be assessed given the way most RCTs are designed.

“It is clear that dichotomization causes some loss of information. The extent of this loss depends on the cut-off, with more imbalance between responders and non-responders resulting in a greater loss of information….

…If the continuous outcome has the normal distribution, then both PBIT (=the probit transformation of the responder proportions …

…Suppose that somebody is planning a trial where a continuous outcome is to be dichotomized to determine the responders and non-responders to some treatment…

…suppose that we expect the responder probabilities to be…

…There are also many trials where the responders and non-responders are not well balanced, and consequently the loss of information is much greater.”

General comment- I’m not sure that it’s a good idea to use the terms “responder” and “non-responder” throughout your article, since you would be promoting the idea that use of these terms is acceptable, when it’s not actually acceptable (?) My understanding is that it’s not possible to distinguish “responders” from “non-responders” in an RCT, since RCTs are virtually never designed in a way that allows us to infer causality for individual trial participants (?)

trumanfrancis · February 20, 2025, 2:22am

decoster2009.pdf (263.0 KB)
This article provides some reasons researchers dichotomize even when they know it is not optimal. I dont find their reasons compelling.

simongates · February 20, 2025, 9:36am

I’d agree that “dichotomisation” (UK spelling!) and “responder analysis” aren’t synonymous, though related. I would use dichotomisation to mean something broader than responder analysis - for example, dichotomising covariates or results (into “significant” and “non-significant”) would also be dichotomisation, but aren’t responder analysis. According to me anyway!

davidcnorrismd · February 20, 2025, 10:29am

IMO Europe should take the lead in everything now, spelling included.

simongates · February 20, 2025, 12:28pm

It might be good in the paper to give some concrete examples of entrenched bad practices that involve dichotomisation (or data reduction more generally). Often this happens because a particular way of measuring outcomes has become accepted in a field. Sometimes lots of effort has been put into promoting a particular type of outcomes by “opinion leaders” - this becomes very hard to push back against.

One of the things that I get annoyed about (maybe unreasonably) is “best response” in cancer trials. This seems an abuse of data to me, involving multiple levels of classification and dichotomisation. First, we collect measurements of tumours over time; then we classify them according to arcane rules (which are essentially just made up) into a small number of categories; then we just pick the best as the outcome. This seems far removed from what is actually happening for the patient.

EvZ · February 20, 2025, 4:41pm

ESMD, Thank you so much for your comments - exactly what we hoped for by posting!

You’re right that dichotomization (or dichotomisation) and responder analysis are not the same. The paper is really about the cost of “responder analysis”, that is dichotomization of the outcome in clinical trials. We should probably change the title to “An Empirical Assessment of the Cost of Responder Analysis”.

You also make a good point by asking why clinical researchers continue to do responder analysis. I’m a statistician at a university medical center, and my impression is that they feel continuous outcomes are noisy and dichtomization removes that noise. So, we wanted to show in various ways that the opposite is true.

You suggest we explain the loss of information as to a 5-year old. I understand your point, and in fact this paper is our best attempt to explain the loss of information as simply as we can, to provide easy methods of calculation and to show clear figures. Now, I work with doctors every day, and I truly have great respect for their knowledge, skill and commitment. But they must understand that data analysis is not for 5-year olds.

Thanks again for your comments and suggestions. We’ll surely make some changes to the paper!

EvZ · February 20, 2025, 4:43pm

trumanfrancis, thanks for the reference!

EvZ · February 20, 2025, 4:45pm

Simon, I agree. The paper is really about the cost of “responder analysis”, that is dichotomisation of the outcome in clinical trials.

You wrote: “Often this happens because a particular way of measuring outcomes has become accepted in a field.” We cite a review of neurological trials with ordinal outcomes (such as the mRS and GOS-E) where almost every cut-off has been used!

Elias_Eythorsson · February 20, 2025, 5:54pm

Or change it to “An empirical assessment of the cost of dichotomization of the outcome of clinical trials”. I as a clinician wouldn’t have known that responder analysis means using dichotomized outcomes in clinical trials. I thought responder analysis meant analyzing only those who had a positive outcome in only the treatment arm of the trial.

ESMD · February 20, 2025, 6:03pm

Thanks for responding Dr. van Zwet

“But they must understand that data analysis is not for 5-year olds.”

I agree with you. I don’t dabble in the stock market because I would have no clue what I’m doing. But, with utmost respect, ignoring the irrationality of human decision-making is not going to produce the change you want to see:

Multiple attempts to address this problem, to date, have failed. As noted in the link provided in the third post above, people have already assessed the reasons for these failures;
Statisticians complain frequently that they get “overruled” by clinicians on important methodologic design decisions;
Unless the implied hierarchy in decision-making for clinical trial design suddenly changes, to give statisticians the final say on these matters, they will have to reflect on why they haven’t been able to convince clinicians to change their ways;
Clearly, the phrase “trust me, I’m a statistician” isn’t going to cut it when the goal is large-scale behaviour change (though this phrase would cut it for me personally);
If I’m being completely frank, statisticians, as a group, are not good written communicators. Many are clearly brilliant- they have to be. Their subject matter is a cognitively demanding mix of math, philosophy, and science. But if nobody can ever understand the principles underlying their recommendations (because they’re written in hieroglyphics that virtually nobody with decision-making power understands), progress in their field will be stymied. To date, I haven’t read a single article on responder analysis, written by a statistician, that isn’t eye-wateringly confusing to me as a physician. As a family doctor, I toil every day to convert medical lingo into terms that my patients can understand. This is an essential part of my job- if I don’t take the time to ensure that I’m being understood, the patient is incapable of making an informed decision about therapies I recommend. And, more often than not, a patient who doesn’t understand is biased toward inaction. Use this analogy as you see fit.

EvZ · February 20, 2025, 8:57pm

Dear Erin (I just noticed that I can see your name if I click on the icon)

This is a very interesting and important question: What kind of explanation may clinicians expect from statisticians, and what level of statistical sophistication may statisticians expect from clinicians. Maybe worth a separate Datamethods topic?

As I mentioned, I’m a statistician at a university hospital. I usually meet face to face with my clinical collaborators, and then I can explain what they need to know. In a methods paper, that’s much harder to do.

For example: With this paper about dichotomization / responder analysis, the key is to compare like with like. The probit transformation (PBIT) of the responder probabilities is directly comparable with the standardized mean difference (SMD) of the continuous outcome. I know that this PBIT is not so familiar to many clinicians, so I tried to explain with a picture (Figure 1 above).

In a face to face consultation, I would take my time to explain the Figure, and maybe calculate a little example. This would take half an hour, and my clinical collaborator would listen to me and maybe ask questions. If I did the same in my methods paper, it would be twice as long and no clinician would have the patience for that. So, what can I do?

Erik

f2harrell · February 20, 2025, 11:51pm

I like that, because many binary outcomes such as death are not what most people mean by responder analysis.

Pavlos_Msaouel · February 22, 2025, 10:34pm

What a fantastic effort and comments! This empirical evaluation is a sorely needed evidence-based justification why outcome dichotomization is generally a bad idea.

I agree that the term “responder analysis” is confusing and should not be used in the title. In oncology, responder analysis is often used, e.g., to describe features of patients who responded versus those that did not (see example Figure 4 in our recent phase I trial paper here). While it does typically involve dichotomization (or some type of categorization), such responder analysis is orthogonal (and far more exploratory) to the main purpose of RCTs which is to compare outcomes between treatment groups (tremendously enjoyed BTW this just published overview of randomization).

Categorizing outcomes is probably more information-losing than categorizing a predictor variable. For the latter, we have noted here that 35% of oncology phase 3 RCTs converted a continuous predictor to a categorical variable for the purposes of stratified regression.

As a counterpoint, here is my favorite recent defense of dichotomization. The arguments there do not change the fact that outcome categorization in the design and analysis of RCTs is information-losing.

EvZ · February 23, 2025, 10:44am

Pavlos: Yes, dichotomizing p-values into “significant” and “non-significant” is a whole other can of worms! It has its uses, as Tunc, Tunc and Lakens argue, but there is much more one can get out of a p-value, see A New Look at P Values for Randomized Clinical Trials

ESMD · February 23, 2025, 2:36pm

Hi Erik

“So, what can I do”?

Since I don’t know what I’m talking about, statistically-speaking, I can’t give you any content-related suggestions. But I will flag this excellent slideshow that I just came across on Dr.Harrell’s “Statistical Thinking” site:

https://hbiostat.org/talks/bthink.html

His goal with this talk is clear- to de-mystify Bayesian thinking for a non-expert audience. Very little jargon, extensive and very effective use of simple analogies. This is a great example of effective communication with clinicians/lay audiences. I don’t know whether this helps…

Kind regards,
Erin

EvZ · February 23, 2025, 4:08pm

Thanks Erin, that does help. I think you’re right that I should look into other forms of communication, like Frank’s slide show or even make some sort of animation. I know there are some businesses that do stuff like that, but that will cost money. Maybe I can do something myself.

ESMD · February 23, 2025, 9:12pm

“In oncology, responder analysis is often used, e.g., to describe features of patients who responded versus those that did not (see example Figure 4 in our recent phase I trial paper here 2). While it does typically involve dichotomization (or some type of categorization), such responder analysis is orthogonal (and far more exploratory) to the main purpose of RCTs which is to compare outcomes between treatment groups.”

Pavlos- these statements highlight just how much nuance is needed when discussing the pitfalls of “responder analysis.” It seems really important to explicate these nuances in language that clinicians will understand. Understanding why this concept is problematic requires not just statistical intuition but also the ability to reason clinically.

As noted in the “Causal inferences from RCTs” thread (linked in the second post above), attempting to assess causality/treatment “response” at the level of individual patients enrolled in an RCT is often a phenomenologically invalid exercise. Reasons include the fact that 1) many medical conditions (e.g., asthma) have a waxing/waning natural history; and 2) many clinical events (e.g., compression fracture) are not amenable to assessment of the effects of drug dechallenge/rechallenge. In these types of clinical scenarios, causality either can’t be determined at the level of individual RCT participants unless the effect is replicated at the level of the individual (e.g., via dechallenge/rechallenge in a crossover or N-of-1 design)- as in the asthma example, or can’t be determined with any certainty AT ALL because the outcome is permanent- as in the compression fracture example.

However, as you note, there IS a subset of clinical phenomena for which clinicians will be able to assign causality at the level of an individual patient, even though we have NOT been able to witness the effect of either drug dechallenge or rechallenge. Malignancies have a directionally predictable trajectory in the absence of treatment- they will only get worse over time (not fluctuate, like asthma does). Sometimes deterioration is quick (e.g., pancreatic cancer) and sometimes deterioration is very slow (e.g., indolent lymphomas or some prostate cancers). The key point is that malignancies don’t spontaneously improve over time. We don’t see them melt away on imaging in the absence of treatment; if this occurs, the diagnosis of malignancy was likely incorrect. Therefore, if we see evidence of tumour burden lessening over time in a patient who has been exposed to a therapy, it’s very reasonable to conclude that it must have been the therapy that caused that improvement, even though we haven’t tested the patient’s response to treatment dechallenge and then rechallenge (criteria which are considered essential for assessing individual-level causality in the context of medical conditions for which a waxing/waning, rather than steadily deteriorating clinical course is the norm). If a malignant tumour shrinks over time after an intervention, it would be valid to infer that the patient had an “objective response” to the therapy. Of course, inferring treatment efficacy in individual patients in the context of a single-arm early phase cancer study requires that the techniques we use to assess tumour burden serially/over time are reliable.

You are pointing out that while certain clinical scenarios (especially oncology) DO allow us to distinguish “responders” from “non-responders” to a therapy, the study design context in which this is done MATTERS. Specifically, Phase 1 is not Phase 3. While “responder analysis” might be reasonable in the context of a single arm Phase 1 oncology study, it is NOT going to be useful in the context of a Phase 3 RCT, for which the main goal is between-arm comparison. Researchers who try to apply an individual-level “responder analysis” in the context of trials with multiple arms (e.g., Phase 3 RCTs) are betraying a fundamental misunderstanding about the entire purpose of Phase 3 trials. In order to understand the purpose of Phase 3 trials, researchers must internalize (deeply) the purpose of concurrent control.

Let’s be explicit about the reason(s) why we would criticize the notion of calling a patient in a multi-arm randomized trial (e.g., pivotal Phase 3 RCT) a “responder” but not criticize the idea of calling a single arm Phase 1 oncology study patient a “responder.” In other words, why is it considered nonsensical to run “randomized non-comparative trials” (Randomized non-comparative trials: an oxymoron?) but okay to assess “Objective Response Rate” in a single arm Phase 1 study of a cancer treatment (?) At first glance, criticizing the former process but not the latter process seems hypocritical. Let’s be absolutely clear about why these two positions are not actually in conflict.

If we agree that tumour shrinkage can, phenomenologically, be causally attributed to therapy at the level of an individual patient in a single arm Phase 1 oncology study, why should we avoid the temptation to “drill down” to individual patients enrolled in a Phase 3 oncology RCT? For that matter, if tumour shrinkage over time can reliably signal that a therapy is “biologically active,” why don’t we just approve ALL such therapies after Phase 1 and completely skip Phases 2 and 3? Why not just approve all drugs for which we can document tumour shrinkage on imaging after exposure? Of course, the answer is that, when approving new therapies of any kind, biological activity (e.g., tumour-shrinking ability) is not the ONLY important feature of the treatment that regulators need to consider. In most disease areas, RCTs are not comparing a new therapy with NO therapy (or inert placebo), but rather a NEW therapy against “standard of care” therapy. Over time, we want: 1) our treatments to become more efficacious than existing therapies so that patient outcomes will improve; AND 2) our treatments to become less toxic (physically and financially), so that 3) benefit/harm ratios for our therapeutic arsenal become more positive. If a new oncology drug shrinks patients’ tumours dramatically over a short time (as initially noted in a Phase 1 study), but, as noted during a Phase 3 pivotal trial, does so at a similar rate to the standard of care drug (an assessment that requires a between-arm/comparative analysis), yet with much higher rates of intolerable side effects (also a comparative analysis), then we might not want to approve that therapy (or at least not as a “first-line” treatment).

In short, while a demonstration of “biologic response” might be a valid goal of a Phase 1 oncology study (and is an important step in the search for therapies with enough promise in humans to advance to later phase studies), it will be an insufficient bar for judging whether or not to approve an oncology drug (except, perhaps, in the case of highly aggressive diseases with, universally, very poor prognoses). By the time we get to Phase 3, we are past the point of being concerned only with demonstrating biological activity of the therapy (and therefore “drilling down” to the level of individual patients in an attempt to assess tumour response)- this is NOT the primary goal of a Phase 3 trial. And this is why “responder analysis,” as conducted using a “randomized non-comparative” design, makes no sense at all. Why have more than one study arm if you’re not going to compare arms in some way, but rather just focus on individual responses within arms? Such designs betray a researcher’s failure to understand the purpose/value of concurrent control. Once researchers have enough information about a new therapy to enrol patients in an RCT (i.e., a 2-arm study), they are saying that their primary goal is no longer the study of individual patients- it is the weighing of risks and benefits comparatively, at the level of groups, to determine whether or not the drug should be approved, and in which clinical context(s).

As you know, I’m not an oncologist OR a statistician, so it’s possible I have all this wrong- happy to be corrected

Pavlos_Msaouel · February 23, 2025, 10:40pm

Pretty much agree with the above. Note that in our manuscript we use response as a continuous variable (e.g., in Figure 4d) and generally do not categorize until later in the analyses. We are always bound of course by problematic conventions such as the limitations of the RECIST criteria used to assess response by imaging in oncology etc.

Note also that our “responder analysis” is not done in a vacuum. We have plenty of functional experimental data and other preclinical and clinical analyses (observational via correlatives and experimental via trials) that guide our analyses. Even right now I am brainstorming these pathways with my collaborators via text messages. Without that context, any such attempts are likely to be lost in a hopeless maze of possibilities. Basing inferences using clinical data alone is typically a bad idea outside of comparative inferences from RCTs, and even then there are caveats. Essentially: don’t try this at home

BTW, your connection with randomized non-comparative trials (RNCTs) is extremely insightful. Indeed, collaborators focused on the type of responder analyses in our manuscript do not intuitively find anything wrong with RNCTs. Part of the reason why this design is so insidious is that it takes advantage of cognitive blind spots in highly accomplished scientists not familiar with the mechanics of randomization.

kiwiskiNZ · February 24, 2025, 1:52am

This is fascinating & important. If we can use this for my physician friends to give the message across that if you dichtomise you will need to increase sample size by (at least) X then it will be very valuable.

A few thoughts. Reading:

and the comment by Frank, I wondered if you included necessary binary outcomes (like death) in the analysis of the Cochrane studies? Would it be better just to exclude those as they can’t be continuous variables?

Rather than say above “are statistically significant” just state that the p-value was below some pre-defined (magical) value I know what you are getting at, and physicians may respond to it, but I cringe at some of the wording.

I found this statement fascinating and somewhat surprising that so many trials have outcome groups of approximately the same size. Your comment suggests to me that the dichotomisation was not chosen a priori, but only after data was available. Did you look at some of these studies to see what was going on?

Finally, thank you. I appreciate the work and thought gone into this.