Responder Analysis: Loser x 4

Spectacular piece. So much content and so clear.

Please consider adding a bit on using post-randomization responder analysis to identify groups to continue treating as Ridker has proposed in CANTOS?

Here is the paper describing the results on which the idea of only continuing treatment in those with early drops in hsCRP is based: https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)32814-3/fulltext

Their are special cases where we accept responder analysis (tropinin, d-dimer, etc).

Are their criteria we can use to decide when this approach may be reasonable?

Thanks, wonderfully educational write up!

1 Like

Venk would you provide the link?

In neither troponin nor d-dimer was a definition of ‘responder’ or a justified. If you know of a paper that makes such a claim please pass it along. I’m going to edit the original post to add the conditions necessary for a threshold to be used for ‘responder’.

2 Likes

Apologies, I ment along the lines that MI/not-MI is acceptable outcome measure as opposed to tropinin values. In contrast diabetes/not-diabetes is not acceptable outcome measure, we would want A1c/glucose values.

1 Like

Oh - good question. MI is accepted only because it is not questioned. The current definitions of MI are circular, i.e., use troponin, then researchers unsurprisingly show that troponin has high diagnostic value. But a bigger problem is the the size of the infarct, e.g., area of myocardium at risk, is all-important. There are small MIs, large MIs, and everything in between. Not only does calling it just MI vs. no MI not as helpful to the patient or physician, this is holding back clinical research. Small MIs are more random, larger MIs more predictable, and some treatments prevent larger MIs more than they prevent smaller ones. We are in desperate need of a severity of MI measure to use in clinical trials. We have to quit wondering why cardiovascular trials require 6,000-10,000 patients.

If every patient could get an LVEF measured on the day of clinical symptoms of MI we would make progress. But there should be a way to grade MIs on the basis of troponin, CKMB, end EKG. Here’s an example of work we did at Duke in 1988 gives one example of the style of analysis I’m alluding to.

3 Likes

Of course! Should have included it right off. I’ve modified my post above with the link.

Thanks for the link. The CANTOS analysis is physiologically irrelevant, arbitrary, and fails to provide the information we need. The authors confused the sample distribution of the marker with the shape of effect of the marker on clinical outcome. The threshold of 2.0 was not validated. Information was lost, and improper conditioning was used. Other than that it’s OK :smile:.

CANTOS is crying out for a landmark analysis where you enter into an outcome model the baseline and the landmark levels of the marker. In similar situations it is rare that the initial baseline is very powerful in predicting outcomes, and likely that the last baseline (landmark point) is the most powerful. What the reader needs to see is a 3-D plot of the two marker levels vs. relative log hazard of outcome. If you used spline functions and even put a knot at 2.0 and saw any evidence of a discontinuity at the point, I’d faint.

To take a continuous marker and create “responder” out of it with no biological rationale for a discontinuity in nature is a very bad idea.

These analyses are analyses of internal time-dependent covariates. Such time-varying covariates are both predictors and responses to therapy. Interpretations, especially ones with a connotation of causation, are frequently very, very difficult.

1 Like

Makes sense. So is most of your concern about the dichotomania? What about the idea of subgroup selection based on post-randomization variables? Placebo subjects were also not similarly stratified, raising possibility of regression to the mean effects in one arm only.

In other words, is this sort of post-randomization “run-in” a fair way to analyze an RCT?

Whoops forgot to comment on that all-important point. Adding to my post above now.

1 Like

i often have survival data, eg mortality in large admin databases, where censoring is so high it makes sense to treat it as dichotomous (event/no event). is there some threshold where you would begin to consider logistic regression as the most sensible approach? likewise with reducing recurrent events to the first event when very few patients have multiple lesions or multiple readmissions or whatever

The amount of censoring is not what is truly important. It is the length of follow-up and the absolute number of uncensored observations that are more important. And to use logistic regression, the earliest censoring time would have to be after the latest event time. Logistic regression doesn’t know how to handle variable follow-up time. An example where logistic regression may be appropriate is in-hospital mortality or sometimes 30d mortality where everyone alive was followed at least 30d.

1 Like

After much personal study, I ended up asking myself that very question, and coming to the conclusion that nonparametrics are to be preferred, if you don’t want to go the Bayesian route.
There have been Monte Carlo studies going back to the 60’s documenting the sometimes large power advantages of rank tests to parametric counterparts in all cases except for a small (5%) loss of efficiency under strict normality, and about a 14% loss with tails thinner than a normal distribution.

Shlomo Sawilowski and many of his students have published similar results in recent times. An interesting historical POV is in the essay devoted to his colleague R Clifford Blair, who came to similar conclusions after comparing the textbook recommendations to his own Fortran simulations. (Pages 5 to 15 are most relevant). Both worked in the educational assessment realm, IIRC.

The emphasis in the textbooks on normal theory, suggests such a strong prior belief in the validity of parametric assumptions, that they should be incorporated into the model formally, and use a Bayesian analysis. But if one does not want to go the Bayesian route, ordinal regression seems to be the way to go.

In terms of using parametric techniques on Likert scales, there is a huge amount of question begging literature attempting to justify the practice, along with improper appeals to the central limit theorem. Proponents basically assume what needs to be proved – that there is a finite variance to the scale. It is truly amazing to me that the practice not only has gone on for decades, but continues.

Update: I was able to find a draft of the paper linked to – it should be required reading for anyone involved in data analysis.

2 Likes

Dear Professor Harrell,

I read your discussions with much interest as I have recently come across this paper, what are your thoughts?

" Moving beyond threshold‐based dichotomous classification to improve the accuracy in classifying non‐responders” (DOI: https://doi.org/10.14814/phy2.13928)

Furthermore, what are your thoughts on this questionnaire?
Living with cancer: “good” days and “bad” days–what produces them? Can the McGill quality of life questionnaire distinguish between them?
(DOI: https://doi.org/10.1002/1097-0142(20001015)89:8<1854::AID-CNCR28>3.0.CO;2-C)

Thank you.

Kind regards,
Kevin

1 Like

The first paper you reference looks pretty reasonable. I haven’t studied the other stuff.

If outcome is a mix of a change in score and a dichotomization (responder), i.e. having a certain
amount of change and reaching a given score threshold, all assumptions as described in BBR ch.14 would still apply. For example, what if distribution of baseline score is bimodal or multimodal? Different disease processes, measurement errors, or differential ceiling/floor effect would probably unmdermine the outcome.

new/revised fda guideline on weight reduction:

Responder analyses on continuous variables (i.e., the proportion of subjects achieving prespecified thresholds of weight reduction compared with the control arm; for example, greater than or equal to 5%, 10%, or 15% of baseline body weight or BMI) should be interpreted with caution. Specifically, responder analyses may inappropriately exaggerate treatment effects (Abugov et al. 2023). For example, consider a hypothetical case in which treatment benefit exceeds risk if the mean treatment effect is 5% or greater reduction in weight, and in which the means of normally distributed percentage reductions in weight for treatment and control are 6% and 4%, respectively, each with a standard deviation of 1%. Analysis on the continuous outcomes would indicate an inadequate treatment effect equal to 2% weight reduction, while analysis based on a responder threshold of 5% would suggest a large treatment effect, with the difference in response rate between treatment and control equal to 68%. For this reason, responder analyses for the evaluation of weight reduction are generally not recommended. A sponsor intending to pursue a responder analysis as an endpoint should consult with the Agency.

this passage must be new(?) because i know this guideline has previously been used to defend and insist upon responder analyses

1 Like

Hopefully the irony of using % reduction as the example isn’t lost given all the issues with that outcome. I wonder though dichotomania aside isn’t this example just showing a mismatch between utility function/analysis? If the mean treatment effect really is the relevant scale for making treatment decisions then of course you shouldn’t be basing anything on thresholds.

At the risk of being kicked off the island wouldn’t a semi-parametric ordinal just be testing the treatment effect on the difference at all thresholds? Could certainly be reported in terms of mean differences from an efficiency standpoint but is there anything wrong perse with being interested in increasing some exceedance threshold and then using a PO model to assess that efficiently/provide a more transportable summary treatment effect and than summarizing differences in numbers of patients meeting that threshold?

I guess from a decision theory perspective we’d be interested in maximizing expected value of utility but is there a reason to assume it has to be linear in the mean difference?

3 Likes

why not look at the whole study period, eg, random coefficients modelling and look at slopes? or a summary statistic like max or auc? I think the clinicians are interested in: how long will the patient have sustained weight loss. And they like max weight loss too - they try to read these values off published plots of mean over time (which can’t be done). Why not time-to-event analysis then, ie time to cessation of weight loss? but then compliance is made more relevant. Maybe it’s so messy that they fall back on simple linear regression, although mean of % change seems the most difficult to translate into something meaningful

2 Likes

% change does not have the right mathematical properties to be used in statistical analysis.

2 Likes