I like that, because many binary outcomes such as death are not what most people mean by responder analysis.
What a fantastic effort and comments! This empirical evaluation is a sorely needed evidence-based justification why outcome dichotomization is generally a bad idea.
I agree that the term “responder analysis” is confusing and should not be used in the title. In oncology, responder analysis is often used, e.g., to describe features of patients who responded versus those that did not (see example Figure 4 in our recent phase I trial paper here). While it does typically involve dichotomization (or some type of categorization), such responder analysis is orthogonal (and far more exploratory) to the main purpose of RCTs which is to compare outcomes between treatment groups (tremendously enjoyed BTW this just published overview of randomization).
Categorizing outcomes is probably more information-losing than categorizing a predictor variable. For the latter, we have noted here that 35% of oncology phase 3 RCTs converted a continuous predictor to a categorical variable for the purposes of stratified regression.
As a counterpoint, here is my favorite recent defense of dichotomization. The arguments there do not change the fact that outcome categorization in the design and analysis of RCTs is information-losing.
Pavlos: Yes, dichotomizing p-values into “significant” and “non-significant” is a whole other can of worms! It has its uses, as Tunc, Tunc and Lakens argue, but there is much more one can get out of a p-value, see A New Look at P Values for Randomized Clinical Trials
Hi Erik
“So, what can I do”?
Since I don’t know what I’m talking about, statistically-speaking, I can’t give you any content-related suggestions. But I will flag this excellent slideshow that I just came across on Dr.Harrell’s “Statistical Thinking” site:
https://hbiostat.org/talks/bthink.html
His goal with this talk is clear- to de-mystify Bayesian thinking for a non-expert audience. Very little jargon, extensive and very effective use of simple analogies. This is a great example of effective communication with clinicians/lay audiences. I don’t know whether this helps…
Kind regards,
Erin
Thanks Erin, that does help. I think you’re right that I should look into other forms of communication, like Frank’s slide show or even make some sort of animation. I know there are some businesses that do stuff like that, but that will cost money. Maybe I can do something myself.
“In oncology, responder analysis is often used, e.g., to describe features of patients who responded versus those that did not (see example Figure 4 in our recent phase I trial paper here 2). While it does typically involve dichotomization (or some type of categorization), such responder analysis is orthogonal (and far more exploratory) to the main purpose of RCTs which is to compare outcomes between treatment groups.”
Pavlos- these statements highlight just how much nuance is needed when discussing the pitfalls of “responder analysis.” It seems really important to explicate these nuances in language that clinicians will understand. Understanding why this concept is problematic requires not just statistical intuition but also the ability to reason clinically.
As noted in the “Causal inferences from RCTs” thread (linked in the second post above), attempting to assess causality/treatment “response” at the level of individual patients enrolled in an RCT is often a phenomenologically invalid exercise. Reasons include the fact that 1) many medical conditions (e.g., asthma) have a waxing/waning natural history; and 2) many clinical events (e.g., compression fracture) are not amenable to assessment of the effects of drug dechallenge/rechallenge. In these types of clinical scenarios, causality either can’t be determined at the level of individual RCT participants unless the effect is replicated at the level of the individual (e.g., via dechallenge/rechallenge in a crossover or N-of-1 design)- as in the asthma example, or can’t be determined with any certainty AT ALL because the outcome is permanent- as in the compression fracture example.
However, as you note, there IS a subset of clinical phenomena for which clinicians will be able to assign causality at the level of an individual patient, even though we have NOT been able to witness the effect of either drug dechallenge or rechallenge. Malignancies have a directionally predictable trajectory in the absence of treatment- they will only get worse over time (not fluctuate, like asthma does). Sometimes deterioration is quick (e.g., pancreatic cancer) and sometimes deterioration is very slow (e.g., indolent lymphomas or some prostate cancers). The key point is that malignancies don’t spontaneously improve over time. We don’t see them melt away on imaging in the absence of treatment; if this occurs, the diagnosis of malignancy was likely incorrect. Therefore, if we see evidence of tumour burden lessening over time in a patient who has been exposed to a therapy, it’s very reasonable to conclude that it must have been the therapy that caused that improvement, even though we haven’t tested the patient’s response to treatment dechallenge and then rechallenge (criteria which are considered essential for assessing individual-level causality in the context of medical conditions for which a waxing/waning, rather than steadily deteriorating clinical course is the norm). If a malignant tumour shrinks over time after an intervention, it would be valid to infer that the patient had an “objective response” to the therapy. Of course, inferring treatment efficacy in individual patients in the context of a single-arm early phase cancer study requires that the techniques we use to assess tumour burden serially/over time are reliable.
You are pointing out that while certain clinical scenarios (especially oncology) DO allow us to distinguish “responders” from “non-responders” to a therapy, the study design context in which this is done MATTERS. Specifically, Phase 1 is not Phase 3. While “responder analysis” might be reasonable in the context of a single arm Phase 1 oncology study, it is NOT going to be useful in the context of a Phase 3 RCT, for which the main goal is between-arm comparison. Researchers who try to apply an individual-level “responder analysis” in the context of trials with multiple arms (e.g., Phase 3 RCTs) are betraying a fundamental misunderstanding about the entire purpose of Phase 3 trials. In order to understand the purpose of Phase 3 trials, researchers must internalize (deeply) the purpose of concurrent control.
Let’s be explicit about the reason(s) why we would criticize the notion of calling a patient in a multi-arm randomized trial (e.g., pivotal Phase 3 RCT) a “responder” but not criticize the idea of calling a single arm Phase 1 oncology study patient a “responder.” In other words, why is it considered nonsensical to run “randomized non-comparative trials” (Randomized non-comparative trials: an oxymoron?) but okay to assess “Objective Response Rate” in a single arm Phase 1 study of a cancer treatment (?) At first glance, criticizing the former process but not the latter process seems hypocritical. Let’s be absolutely clear about why these two positions are not actually in conflict.
If we agree that tumour shrinkage can, phenomenologically, be causally attributed to therapy at the level of an individual patient in a single arm Phase 1 oncology study, why should we avoid the temptation to “drill down” to individual patients enrolled in a Phase 3 oncology RCT? For that matter, if tumour shrinkage over time can reliably signal that a therapy is “biologically active,” why don’t we just approve ALL such therapies after Phase 1 and completely skip Phases 2 and 3? Why not just approve all drugs for which we can document tumour shrinkage on imaging after exposure? Of course, the answer is that, when approving new therapies of any kind, biological activity (e.g., tumour-shrinking ability) is not the ONLY important feature of the treatment that regulators need to consider. In most disease areas, RCTs are not comparing a new therapy with NO therapy (or inert placebo), but rather a NEW therapy against “standard of care” therapy. Over time, we want: 1) our treatments to become more efficacious than existing therapies so that patient outcomes will improve; AND 2) our treatments to become less toxic (physically and financially), so that 3) benefit/harm ratios for our therapeutic arsenal become more positive. If a new oncology drug shrinks patients’ tumours dramatically over a short time (as initially noted in a Phase 1 study), but, as noted during a Phase 3 pivotal trial, does so at a similar rate to the standard of care drug (an assessment that requires a between-arm/comparative analysis), yet with much higher rates of intolerable side effects (also a comparative analysis), then we might not want to approve that therapy (or at least not as a “first-line” treatment).
In short, while a demonstration of “biologic response” might be a valid goal of a Phase 1 oncology study (and is an important step in the search for therapies with enough promise in humans to advance to later phase studies), it will be an insufficient bar for judging whether or not to approve an oncology drug (except, perhaps, in the case of highly aggressive diseases with, universally, very poor prognoses). By the time we get to Phase 3, we are past the point of being concerned only with demonstrating biological activity of the therapy (and therefore “drilling down” to the level of individual patients in an attempt to assess tumour response)- this is NOT the primary goal of a Phase 3 trial. And this is why “responder analysis,” as conducted using a “randomized non-comparative” design, makes no sense at all. Why have more than one study arm if you’re not going to compare arms in some way, but rather just focus on individual responses within arms? Such designs betray a researcher’s failure to understand the purpose/value of concurrent control. Once researchers have enough information about a new therapy to enrol patients in an RCT (i.e., a 2-arm study), they are saying that their primary goal is no longer the study of individual patients- it is the weighing of risks and benefits comparatively, at the level of groups, to determine whether or not the drug should be approved, and in which clinical context(s).
As you know, I’m not an oncologist OR a statistician, so it’s possible I have all this wrong- happy to be corrected
Pretty much agree with the above. Note that in our manuscript we use response as a continuous variable (e.g., in Figure 4d) and generally do not categorize until later in the analyses. We are always bound of course by problematic conventions such as the limitations of the RECIST criteria used to assess response by imaging in oncology etc.
Note also that our “responder analysis” is not done in a vacuum. We have plenty of functional experimental data and other preclinical and clinical analyses (observational via correlatives and experimental via trials) that guide our analyses. Even right now I am brainstorming these pathways with my collaborators via text messages. Without that context, any such attempts are likely to be lost in a hopeless maze of possibilities. Basing inferences using clinical data alone is typically a bad idea outside of comparative inferences from RCTs, and even then there are caveats. Essentially: don’t try this at home
BTW, your connection with randomized non-comparative trials (RNCTs) is extremely insightful. Indeed, collaborators focused on the type of responder analyses in our manuscript do not intuitively find anything wrong with RNCTs. Part of the reason why this design is so insidious is that it takes advantage of cognitive blind spots in highly accomplished scientists not familiar with the mechanics of randomization.
This is fascinating & important. If we can use this for my physician friends to give the message across that if you dichtomise you will need to increase sample size by (at least) X then it will be very valuable.
A few thoughts. Reading:
and the comment by Frank, I wondered if you included necessary binary outcomes (like death) in the analysis of the Cochrane studies? Would it be better just to exclude those as they can’t be continuous variables?
Rather than say above “are statistically significant” just state that the p-value was below some pre-defined (magical) value I know what you are getting at, and physicians may respond to it, but I cringe at some of the wording.
I found this statement fascinating and somewhat surprising that so many trials have outcome groups of approximately the same size. Your comment suggests to me that the dichotomisation was not chosen a priori, but only after data was available. Did you look at some of these studies to see what was going on?
Finally, thank you. I appreciate the work and thought gone into this.
This is the framing I would use. When dealing with clinicians, I find that sample sizes are one of the only statistical considerations they make before asking me (a stat-curious health economist, not a statistician, granted) how to begin structuring their grant application of experiment.
In all honesty I think the main reason for dichotomisation of any variable that I have experienced is just a lack of knowledge that you can do flexible non-linear transformations or just interpret the variable as-is without having some label like obese/non-obese.
Edit: this was meant to be in agreement to John Pickering above - not sure the quotes came through.
If we can use this for my physician friends to give the message across that if you dichtomise you will need to increase sample size by (at least) X then it will be very valuable.
The comments on this blog made me realize that I should make a (shiny) app to do the calculations:
- Ex post: compute the loss of information from the observed responder proportions;
- Ex ante: compute the required sample size, with and without dichotomizing the outcome, from the hypothesized responder probabilities.
I wondered if you included necessary binary outcomes (like death) in the analysis of the Cochrane studies? Would it be better just to exclude those as they can’t be continuous variables?
For every trial in the Cochrane database, I included the efficacy endpoint which is reported first. Some are binary and some are continuous. The database does have short descriptions of the endpoints, so I could match for strings like “death”, “survival”, “mortality”, etc.
I found this statement fascinating and somewhat surprising that so many trials have outcome groups of approximately the same size. Your comment suggests to me that the dichotomisation was not chosen a priori, but only after data was available. Did you look at some of these studies to see what was going on?
Maybe I could spot-check some of those studies, but it would be quite a lot of work to go back to the protocols to see what happened. More generally, the shape of Figure 6 above is not easy to interpret, because we don’t know what it would look like if the cut-offs were really only decided on clinical relevance.
I’m sure there’s a lack of knowledge, but I believe that there is also a very strong intuitive sense that continuous measurements are noisy while binary ones are crisp. Dichtomization then feels like data cleaning or “de-noising”. Unfortunately, the opposite is true as dichotomization amplifies noise.
I am a clinician like Erin. I think that there are two issues here:
- Statisticians minimising the size of a random sample from a data set required to detect a specified P value with specified null hypothesis and a specified power indicating a high probability of a ‘real effect’.
- Clinicians interpreting the above information to arrive at clinical decisions (ideally in conjunction with a patient).
Statistical considerations
If the predictor variable (e.g., age) and outcome variable (e.g. BMI) are both numerical, it seems to me that the optimum P value would be arrived at by calculating a correlation coefficient and a P value based on a t-test. I did this and got a P value of 0.000000008. I then divided the (predictor) ages into two groups, up to 40 years and over 40 years (thus losing the details of actual age). A t-test of the differences in the outcome of numerical BMI in both groups gave a P-value of 0.0000004. I then divided the outcome variable into BMIs of up to 25 and over 25 and got a P value of 0.00001. So, as expected intuitively, the P value is lowest when both the predictor and outcome variables are numerical and highest when both are dichotomised and obscuring the data. The corollary to this is that the sample size will be smallest when both predictor and outcome variables are numerical and highest when they are both dichotomised.
Clinical decisions
In order to make any decision there is usually a trigger. One of these triggers is the action that leads to the greatest expected utility (i.e. the best calculated bet). The clinician must therefore consider the options addressed by a RCT and results arising from it. By making various assumptions, by applying various reasoning processes, guesswork or calculations (ideally in conjunction with the patient), a decision is made. If the RCT outcome are probability densities in the form of distributions of some surrogate outcome on treatment and control (e.g. a highly statistically significant change in the BMI due to a new weight reducing drug), then its application can be difficult. I gave an outline of such a complex rationale required in a previous post (https://discourse.datamethods.org/t/some-thoughts-on-uniform-prior-probabilities-when-estimating-p-values-and-confidence-intervals/7508/45?u=huwllewelyn ).
I have painful memories of a long discussion about fixed responders and fixed non-responders in regard to counterfactuals when they were classified into 4 groups of always recover responders, never recover responders, always benefit responders and always harm responders, without recognising that for stochastic reasons, individuals tend to move from one group to another in different studies (https://discourse.datamethods.org/t/individual-response/5191/226?u=huwllewelyn ). For this reason, I would agree that the term ‘responder’ should be avoided!
Great discussion of an important topic. Emphasizing the cost in terms of sample size seems like a very promising approach.
For that I would encourage moving well beyond the scenarios considered in the draft paper. Of most concern to me are the many situations in which effects are concentrated at extreme values; in those cases use of dichotomies or percentile categories (as is the norm) can bury effects, and the information loss can be far greater than what is portrayed in the usual statistical investigations (which have focused too much on effects on normal means).
For alternatives to categorized outcomes, it seems underappreciated that some ordinal regression methods require no categorization at all, and can be used for semiparametric regression analysis of continuous outcomes, without constraints on the residuals such as normality. An example is continuation-ratio regression, which can be seen as proportional-hazards regression applied to outcomes other than failure time. It is easily conducted using ordinary survival or logistic-regression software, e.g., see
Greenland S (1994). Alternative models for ordinal logistic regression. Statistics in Medicine, 13, 1665-1677
although I expect there are newer descriptions available (perhaps within textbooks).
That brings up how outcomes which at first seem intrinsically binary aren’t necessarily so. Take death: Upon complete lifetime follow-up the death indicator is always 1; thus it contains no information about anything. A more informative analysis thus asks how long this inevitable event is forestalled, i.e., what are effects on survival time, a continuous outcome. Common survival analyses (e.g., logrank, proportional hazards) use the time ranks but not the full continuous information; hence Cox himself often opined that those methods were over-used, and that flexible but parametric survival models should be deployed to recover information beyond the ranks. Those models can be applied to any positive continuous outcome, and allow flexible continuous modeling of covariates as well, e.g., as described in
Greenland S (1995). Dose-response and trend analysis: Alternatives to category-indicator regression. Epidemiology, 6, 356-365.
In sum, alternatives to categorization of outcomes and covariates have been widely available in ordinary software for generations. I suggest that the persistence of unnecessary categorization and fallacious rationales for it reflects an educational problem synergized with a psychosocial problem of methodological conservatism (not unlike the continued defenses and misuses of NHST, which have persisted despite critiques dating back over a century and now numbering in the hundreds).
Wonderful thoughts Sander. My only suggestion is that the relative efficiency lost by not using actual survival times is almost unnoticeable.
Thanks Frank, but (surprise): Sorry, I must disagree (as I think would have Cox) about the inefficiency and other losses from not using the actual times. I think the lore about the loss being small stems from looking only at first-order asymptotic relative efficiencies for hazard ratios under proportional-hazards models. Cox himself expressed concern about over-reliance on such results and models: The loss depends heavily on the underlying form of the survival distributions, on what measure is targeted as the comparative summary, and on the smaller sample (higher-order) behavior of the chosen estimator.
While I don’t have the cites on hand (they are from 40+ years ago), with heavy censoring in time-dependent difference comparisons, the finite-sample efficiency gains from using realistic smooth hazard forms could exceed 50% due to their use of smoothness constraints and exact censoring-or-failure times. This may be no surprise in light of results on efficiency loss in nonparametric comparisons of continuous outcomes based on ranks. Nor should they surprise a calibrated Bayesian who views prior constraints as most important when there are concerns about the breakdown of ordinary asymptotics.
I would further caution against generalizing observations under proportional-hazards models to comparisons based on other models. For example, in the accelerated failure-time models used in g-estimation (e.g., in adjusting for nonadherence), accounting for covariates and censoring is much more complex (requiring inversion of probability models) and ordinary asymptotics may not be as good a guide to finite-sample behavior as they are in proportional-hazards regression (where the model creates a huge simplification via likelihood factorization).
In case there’s any chance that Dr. Van Zwet’s article might mention “responders,” I thought it could be useful to highlight the type of evidence that’s needed to actually label a patient a “responder.”
If we call someone a “responder,” we are implying that their therapy has “caused” them to experience a certain effect of interest. In other words, this label requires demonstration of a causal effect of therapy at the level of an individual patient. The purpose of this post is to show that the type of evidence needed to call a patient a “responder” will depend on 1) the UNtreated natural history/trajectory of the disease in question; and 2) the expected impact of the therapy being tested.
The solid lines in the five clinical scenarios below show typical UNtreated disease trajectories for various diseases. We’ll imagine what might happen if we were to apply a therapy with intrinsic efficacy/biological activity. The arrow shows the time at which the therapy is applied. The highlighted dotted line after each arrow shows the potential effect of the therapy (with regard to sign/symptom severity/disease activity). The solid line extending past the arrow shows the patient’s expected trajectory if the therapy hadn’t been applied. Examples of conditions that conform to each trajectory are provided, with therapeutic options noted in brackets.
Scenario 1:
“Waxing and waning” disease course- e.g., asthma (inhalers), chronic pain (analgesics), depression (antidepressants), mild to moderate autoimmune disease (e.g., IBD, RA)
Requirement for demonstrating causality at the level of an individual patient: EITHER RCT with multiple crossover periods OR N-of-1 trial is needed. Observing a single period of therapy exposure is INSUFFICIENT to infer therapeutic efficacy for the particular patient in question.
Rationale; With a waxing/waning natural history, the causal effect of the intervention, in a specific patient, can only be disentangled from spontaneous improvement/natural fluctuation by observing REPLICATIONS of the effect via therapy dechallenge then rechallenge. Dechallenge/rechallenge helps to isolate the causal effect of the intervention from other (often unknown) factors that can lead to natural fluctuation in disease course.
Scenario 2:
Temporary “slowing” of disease progression, underlying disease is relentlessly progressive- e.g., Alzheimer’s disease (cholinesterase inhibitors)/other neurodegenerative diseases
Requirement for demonstrating causality at the level of an individual patient: Very challenging
Rationale: Therapy doesn’t generally cause net improvement in patient’s clinical state, but rather might temporarily slow the rate of deterioration for some subset of patients. Identifying “responders” hinges on valid and highly granular mapping of disease trajectory before and during treatment but is confounded by heterogeneous rates of disease progression/undulating deterioration. Scores on cognitive/functional tests for some diseases can vary from day to day or hour to hour for unclear reasons.
Scenario 3:
Rapid improvement in signs/symptoms very soon after therapy is started, for a disease that has been highly symptomatic for a prolonged period of time- e.g., steroid-dependent autoimmune diseases (asthma/RA/IBD/psoriasis) (biologics)
Requirement for demonstrating causality at the level of an individual patient: Causal effect of therapy is easy to identify clinically for individual patients- crossover/N-of-1 design not necessarily needed
Rationale: Abrupt improvement in the signs/symptoms of a disease that has been highly symptomatic for many years provides clinically compelling evidence that the therapy has caused the patient’s improvement. Therapies that cause such rapid/dramatic improvement are ones that tend to be highly efficacious.
Scenario 4:
Reduction in disease burden (e.g., tumour burden) soon after start of therapy, for a disease that is known, otherwise, to be relentlessly progressive- e.g., a tumour “melting away” on imaging after starting a new cancer therapy.
Requirement for demonstrating causality at the level of an individual patient: Causal effect of therapy might be easy to identify clinically for individual patients (“objective response”), provided techniques for measuring disease burden serially are granular/reliable.
Rationale: Tumour burden is NOT expected to improve “spontaneously.” Therefore, reduction in tumour burden in a patient after application of a therapy indicates that the effect was caused by the therapy.
Scenario 5:
Rapid resolution of signs/symptoms soon after the start of a therapy for an acute, highly symptomatic medical condition- such therapies are considered “curative.” e.g., epinephrine for anaphylaxis, primary PCI for STEMI, naloxone for opioid overdose.
Requirement for demonstrating causality at the level of an individual patient: Causal effect of therapy is easy to identify clinically for individual patients.
Rationale: UNtreated clinical course for many acute conditions is often stereotypical/well-known. Clinically rapid reversal of such conditions is not expected to occur spontaneously. Therefore, abrupt reversal after applying the therapy provides clinically compelling evidence that the therapy caused the reversal/cure.
@ESMD I think this very thoughtful latest reply deserves its own thread for future referencing purposes, if @f2harrell agrees.
Done- thanks for the suggestion.
Really great explanation!
Sander from simulations I ran eons ago, and papers I’ve read, I have seen trivial loses of efficiency from using semiparametric models. For example if you have two exponential distributions and optimally parametrically estimate the hazard ratio, the standard error of its log is extremely close to that from the Cox model. The only example I remember seeing where there was a non-tiny difference in efficiency is precision in \hat{S}(t|X) for Cox vs. a parametric S(t) where the choice of the parametric model was correct.