Responder Analysis: Loser x 4

Clinician-scientists involved in clinical trials who have inadequate statistical input often design clinical studies to use a binary responder analysis. They seem to believe that because a binary treatment decision is required (it’s not), the outcome should be binary (it doesn’t). Decisions can be deferred, doses changed, a treatment can be launched on a trial basis, etc., and blood pressure trials work quite well in mmHg rather than using the proportion of patients who achieved a certain blood pressure threshold.

Responder analysis is a 4-fold loser. It fails to give the needed clinical interpretation, has poor statistical properties, increases the cost of studies because binary outcomes require significantly higher sample sizes, and raises ethical issues because more patients are required to be randomized than would be needed were the endpoint and analysis to be fully efficient.

First consider the best (though still not good) case for dichotomizing a continuous or ordinal outcome. Suppose that an ordinal scale with 8 levels, scored 1-8, and that there was a clinically well accepted break between levels 2 and 3 so that patients having a score of 1 or 2 were clearly in remission of their disease and those at levels 3-8 were not in remission. Dichotomizing the scale as 3-8 vs. 1-2 could be used to estimate the proportion in remission. But if patients in level 1 are better off than those at level 2, or those at level 8 were worse off than those at level 3, there is a distinct loss of clinical information. The dichotomizing clinician-scientist asked the question “what’s the chance of achieving remission” but forgot to ask “how far out of remission were the patients?”.

As an example, a recent JAMA article by Costello et al contains the following description of the primary study input in this ulcerative colitis trial:

The primary outcome was steroid-free remission of UC, defined as a total Mayo score of ≤2 with an endoscopic Mayo score of 1 or less at week 8. Total Mayo score ranges from 0 to 12 (0 = no disease and 12 = most severe disease).

Dealing with the clinical interpretation of the trial, would clinical investigators not want do know by how many levels the ordinal scale was typically improved by the new treatment? And now there is a new problem—a perceived issue with clinical interpretation of an ordinal scale has been replaced by a different interpretation problem. Suppose that 3/10ths of patients responded to placebo and 4/10ths responded to a new drug. Is the population difference between 0.3 and 0.4 clinically significant for an individual patient? Wouldn’t the amount of disability, pain, quality of life, exercise time, etc. that the drug improved be of more interest and actually be easier to interpret regarding patient benefit?

Consider a different example where the original outcome scale was carefully derived after choosing a series of 20 appropriate questions about symptoms and function, scoring each from 0-4, and using appropriate weights to create a 0-80 scale. Such a scale can have rich statistical information, maximizing power. But if a clinician-scientist dichotomized the scale, she is saying in effect that “This scale stinks. I don’t trust it.” Why go to all the trouble to ask 20 questions and create and validate the scoring system only to do this?

Section 18.3.4 of BBR details the statistical power loss from dichotomizing a continuous or ordinal outcome. A key paper referenced there is Fedorov, Mannino, and Zhang who showed that if you dichotomize a continuous variable at the optimum point (the true population median), the needed sample size for the study might be 158 patients for the dichotomized endpoint but only 100 for the original continuous variable. And that is the best case for dichotomization. Were the scale dichotomized at 2 standard deviations above the true mean, the sample size would need to be increased by a factor greater than 5!

What is even worse than simply dichotomizing an ordinal or continuous scale? Using change from baseline and/or combining absolute change with relative change and not understanding the math. The BBR section referenced above includes a classic example from Stephen Senn reproduced below.

z

By analyzing the situation and plotting the probability of response as a function of the baseline value, one can see that what makes a patient a “responder” is as much the patient himself, not the treatment. Clinicians and clinician-scientists can avoid much of this problem by remembering the simple fact that what matters most to the patient, and what is easiest to model statistically, is the post-treatment status. Baselines should be accounted for (to increase power) using covariate adjustment.

A more recent example of this latter sort of dichotomania combined with contortion is the paper by Rutgeerts et al which uses this primary outcome variable:

A response was defined as a decrease in the Mayo score of at least 3 points and at least 30 percent, with an accompanying decrease in the subscore for rectal bleeding of at least 1 point or an absolute rectal-bleeding subscore of 0 or 1.

It is true that special considerations are needed for endpoints that combine multiple components, and this can probably be done with an ordinal scale. But note that it is very difficult to equate a certain change in one variable with a certain change in another variable. it is far more justifiable to combine absolutes and covariate adjust for baseline values. The bad math identified by Senn is in place in this NEJM article also, and the Mayo score is not one that works by percent change.

As discussed above, dichotomization to create a “responder” has profound sample size implications. This means that the study budget must be increased to pay for the “dumbing down” of the endpoint. That’s where the ethical problem comes in. Section 18.7 of BBR is a detailed case study of a treatment for Crohn’s disease that used the Crohn’s disease activity index (CDAI). In the case study I reanalyzed the data in the original CDAI article and demonstrated that there is no possible cutoff in CDAI, and were a cutoff to exist, it would certainly not be the one used in the clinical trial being discussed (150). See the graph below.

z

We convened an ethics-biostatistics clinic in which three of our esteemed biomedical ethicists participated. These ethicists concluded that the proposed new trial was not ethical, for these reasons.

  1. Patients are not consenting to be put at risk for a trial that doesn’t yield valid results.
  2. A rigorous scientific approach is necessary in order to allow enrollment of individuals as subjects in research.
  3. Investigators are obligated to reduce the number of subjects exposed to harm and the amount of harm to which each subject is exposed.

As a final note, once an optimal covariate-adjusted analysis of an ordinal or continuous endpoint, using the outcome scale in its original form, is completed, the statistical model can be used to translate the result to any clinical scale desired. For example, from either a linear model or an ordinal response model one can easily compute the probability that a patient will achieve an outcome below any given threshold. So there is no reason to dichotomize the outcome variable for the primary endpoint.

:new: Addendum

Under what conditions is a designation of responder or non-responder useful clinically? Clinicians will certainly have additional thoughts but here’s a start. If all of these conditions are satisfied, a binary characterization may be useful clinically:

  • The definition of responder uses only current patient status, or it deals with change from a specific time point and it can be verified that the change score means the same thing for each patient, e.g., that change from baseline is independent of baseline and there are no floor- or ceiling effects in the score or measurement.
  • Patients designated as responders are indistinguishable from each other in terms of the quality of their clinical outcome.
  • Patients designated as non-responders are indistinguishable from each other in terms of the quality of their clinical outcome.

Of course if the clinical measure originated as a truly “all or nothing” assessment with no other information available, that measure may be used as the responder definition. But for continuous or ordinal measurements, responder dichotomizations should in my view be replaced by matters of degree, e.g., the original scale.

What is an example of the most useful prognostication to convey to a patient? “Patients such as yourself who are at disability level 5 on our 10-point scale tend to be at disability level 2 after physical rehabilitation. Here are the likelihoods of all levels of disability for patients starting at level 4: (show a histogram with 10 bars).”

What are the criteria for a responder definition to have good statistical properties? The conditions are the same as the clinical conditions. If a continuous or ordinal measure is dichotomized to create “responders”, it is necessary that the responders be homogeneous in their outcomes and likewise for the non-responders. For communication purposes it is also necessary that the cutpoint used to create the dichotomization be the same one used in previous studies.

See Also

15 Likes

Is it a major breach to treat an ordinal scale with inconsistent intervals as a continuous outcome? In my field there is a five point ordinal scale that is used to measure a subjective clinical outcome and I frequently see mean outcomes compared between treatment and control arms.

1 Like

Much has been written on that. It might go all the way from a minor to a major breach. Why take the chance when we have wonderful ordinal regression models that make no assumptions about inter-category spacings?

2 Likes

Spectacular piece. So much content and so clear.

Please consider adding a bit on using post-randomization responder analysis to identify groups to continue treating as Ridker has proposed in CANTOS?

Here is the paper describing the results on which the idea of only continuing treatment in those with early drops in hsCRP is based: https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)32814-3/fulltext

Their are special cases where we accept responder analysis (tropinin, d-dimer, etc).

Are their criteria we can use to decide when this approach may be reasonable?

Thanks, wonderfully educational write up!

1 Like

Venk would you provide the link?

In neither troponin nor d-dimer was a definition of ‘responder’ or a justified. If you know of a paper that makes such a claim please pass it along. I’m going to edit the original post to add the conditions necessary for a threshold to be used for ‘responder’.

2 Likes

Apologies, I ment along the lines that MI/not-MI is acceptable outcome measure as opposed to tropinin values. In contrast diabetes/not-diabetes is not acceptable outcome measure, we would want A1c/glucose values.

1 Like

Oh - good question. MI is accepted only because it is not questioned. The current definitions of MI are circular, i.e., use troponin, then researchers unsurprisingly show that troponin has high diagnostic value. But a bigger problem is the the size of the infarct, e.g., area of myocardium at risk, is all-important. There are small MIs, large MIs, and everything in between. Not only does calling it just MI vs. no MI not as helpful to the patient or physician, this is holding back clinical research. Small MIs are more random, larger MIs more predictable, and some treatments prevent larger MIs more than they prevent smaller ones. We are in desperate need of a severity of MI measure to use in clinical trials. We have to quit wondering why cardiovascular trials require 6,000-10,000 patients.

If every patient could get an LVEF measured on the day of clinical symptoms of MI we would make progress. But there should be a way to grade MIs on the basis of troponin, CKMB, end EKG. Here’s an example of work we did at Duke in 1988 gives one example of the style of analysis I’m alluding to.

3 Likes

Of course! Should have included it right off. I’ve modified my post above with the link.

Thanks for the link. The CANTOS analysis is physiologically irrelevant, arbitrary, and fails to provide the information we need. The authors confused the sample distribution of the marker with the shape of effect of the marker on clinical outcome. The threshold of 2.0 was not validated. Information was lost, and improper conditioning was used. Other than that it’s OK :smile:.

CANTOS is crying out for a landmark analysis where you enter into an outcome model the baseline and the landmark levels of the marker. In similar situations it is rare that the initial baseline is very powerful in predicting outcomes, and likely that the last baseline (landmark point) is the most powerful. What the reader needs to see is a 3-D plot of the two marker levels vs. relative log hazard of outcome. If you used spline functions and even put a knot at 2.0 and saw any evidence of a discontinuity at the point, I’d faint.

To take a continuous marker and create “responder” out of it with no biological rationale for a discontinuity in nature is a very bad idea.

These analyses are analyses of internal time-dependent covariates. Such time-varying covariates are both predictors and responses to therapy. Interpretations, especially ones with a connotation of causation, are frequently very, very difficult.

1 Like

Makes sense. So is most of your concern about the dichotomania? What about the idea of subgroup selection based on post-randomization variables? Placebo subjects were also not similarly stratified, raising possibility of regression to the mean effects in one arm only.

In other words, is this sort of post-randomization “run-in” a fair way to analyze an RCT?

Whoops forgot to comment on that all-important point. Adding to my post above now.

1 Like

i often have survival data, eg mortality in large admin databases, where censoring is so high it makes sense to treat it as dichotomous (event/no event). is there some threshold where you would begin to consider logistic regression as the most sensible approach? likewise with reducing recurrent events to the first event when very few patients have multiple lesions or multiple readmissions or whatever

The amount of censoring is not what is truly important. It is the length of follow-up and the absolute number of uncensored observations that are more important. And to use logistic regression, the earliest censoring time would have to be after the latest event time. Logistic regression doesn’t know how to handle variable follow-up time. An example where logistic regression may be appropriate is in-hospital mortality or sometimes 30d mortality where everyone alive was followed at least 30d.

1 Like

After much personal study, I ended up asking myself that very question, and coming to the conclusion that nonparametrics are to be preferred, if you don’t want to go the Bayesian route.
There have been Monte Carlo studies going back to the 60’s documenting the sometimes large power advantages of rank tests to parametric counterparts in all cases except for a small (5%) loss of efficiency under strict normality, and about a 14% loss with tails thinner than a normal distribution.

Shlomo Sawilowski and many of his students have published similar results in recent times. An interesting historical POV is in the essay devoted to his colleague R Clifford Blair, who came to similar conclusions after comparing the textbook recommendations to his own Fortran simulations. (Pages 5 to 15 are most relevant). Both worked in the educational assessment realm, IIRC.

The emphasis in the textbooks on normal theory, suggests such a strong prior belief in the validity of parametric assumptions, that they should be incorporated into the model formally, and use a Bayesian analysis. But if one does not want to go the Bayesian route, ordinal regression seems to be the way to go.

In terms of using parametric techniques on Likert scales, there is a huge amount of question begging literature attempting to justify the practice, along with improper appeals to the central limit theorem. Proponents basically assume what needs to be proved – that there is a finite variance to the scale. It is truly amazing to me that the practice not only has gone on for decades, but continues.

Update: I was able to find a draft of the paper linked to – it should be required reading for anyone involved in data analysis.

2 Likes

Dear Professor Harrell,

I read your discussions with much interest as I have recently come across this paper, what are your thoughts?

" Moving beyond threshold‐based dichotomous classification to improve the accuracy in classifying non‐responders” (DOI: https://doi.org/10.14814/phy2.13928)

Furthermore, what are your thoughts on this questionnaire?
Living with cancer: “good” days and “bad” days–what produces them? Can the McGill quality of life questionnaire distinguish between them?
(DOI: https://doi.org/10.1002/1097-0142(20001015)89:8<1854::AID-CNCR28>3.0.CO;2-C)

Thank you.

Kind regards,
Kevin

The first paper you reference looks pretty reasonable. I haven’t studied the other stuff.