Dichotomization

What I was alluding was something akin to your Scenario #4 here. That is, a relentlessly progressive cancer whose natural history is well known and where tumor shrinkage is unheard of.

In such cases, suppose a substantial minority of patients in the treatment arm, say a third, have their tumors shrink greatly. No shrinkage occurs in the other two thirds. In the control arm, no shrinkage occurs in anyone.

In these cases, it is probably reasonable to speak of that third in the treatment arm as being “responders” in the sense that they exhibited an otherwise unheard of course after their treatment. It is also probably fair to say that the treatment caused a third of the patients in the treatment arm to respond, wherein causality is implied by the fact that nothing else is known to cause such shrinkage.

Now, in said trial, we may not necessarily know why these patients exhibited this incredible course after their treatment. We therefore wouldn’t be able to be pre-specify a subgroup analysis of any sort (as we would not know which patient characteristics could’ve possibly interacted with the treatment to cause this unusual response). But nevertheless we would probably be justified in thinking that there’s some hitherto undiscovered feature(s) in that third of patients that caused the treatment to elicit such an unusually dramatic shrinkage of the tumor burden.

Again, this is completely different to heart failure, where 5-point changes are completely within the range of what one would expect in the disease’s natural history. In such cases I agree with you that comparing the proportions of “responders” is not at all advisable.

Pavlos is a lot more qualified than I am to comment on your post, but I’m inclined to agree with this statement. In this particular context, the “change from baseline” (i.e., reduction in tumour burden) can be validly attributed, clinically, to causality at the level of the individual patient since such change would not be expected to occur in the absence of intervention. But, as Pavlos noted in post #13 above, such individual-level and within-arm assessments aren’t the main goal of RCTs, even in oncology, where individual causality is sometimes assessable:

Outside the context of diseases with natural histories compatible with scenarios #3-5 in post #28 above, I maintain that continuing use of the terms “response” and “responder” will cause mass causal confusion and perpetuate many bad statistical practices by inexpert clinicians.

2 Likes

Yes, agreed. I think the word “responder” carries an inherently causative undertone (and, in the context of a trial aiming to assess a given treatment, this “causative” effect will invariably be attributed to the treatment in question without due consideration).

2 Likes

One way of understanding disease is that it is due to failure of a control mechanism (usually many of them). These failures can be upward or downward displacements of metabolic variables (e.g. blood sugar) and physiological variables (e.g. BP) when the feedback is known as homeostasis. Damaged tissue (e.g. a wound) is corrected by feedback re-growth and unwanted tissue growth (e.g. infection or tumour) is removed by the immune system which is a feedback process. Mostly, these things happen imperceptibly. However in disease states, the ‘elasticity’ of a feedback mechanisms is reduced or overcome leading failure of homeostasis or tissue repair with unpleasant outcomes or death if the essential feedback mechanism fails completely and cannot be substituted. There are also social feedback mechanisms.

image

Figure 1 shows sigmoid curves that model of these processes and their response to treatment. For example, if there is a small tumour with some low malignancy score (e.g. a fictitious score below 25), the feed-back removal of abnormal tissue will be effective. The probability of death within some specified time would be very low. Therefore a treatment (e.g. with an odds ratio of 0.2 say) will create very little absolute risk reduction. If the tumour is bulky with a high malignancy score, then the feedback removal will be weak and the probability of death may be 0.99 without treatment. At this level, the odds ratio of 0.2 will also have very little effect on lowering absolute risk. However, there may be a ‘sweet mid-range’ where the feedback mechanism can be helped so that the treatment odds ratio will have a useful effect on risk reduction (e.g. at a score of 40, the risk of death is reduced from 0.7 to 0.3, an absolute risk reduction of 0.4)…

Patients in the very high and very low malignancy score ranges will be poor responders on average and those in the sweet mid-range will be good responders on average. These various ranges represent treatment heterogeneity. I suspect that disease severity is by far the most important cause of treatment heterogeneity in medicine. It will not be possible to identify the ‘individual responders’ or non-responders with certainty unless someone had a probability of 1 of dying on placebo from a RCT and survives on treatment (as pointed out by Erin). Thus if in a RCT, 0% survive on placebo and 50% survive on treatment then survival will have been caused by treatment with a probability of 1. However, if 1% survive on placebo, what is the probability that someone’s survival will have been caused by the treatment?

2 Likes

I will propose an answer to this question. If in some RCT, 0% survive on placebo and 60% survive on treatment, then 60-0 = 60% will have been caused to survive by treatment. Therefore, of the total 60% in the trial surviving on treatment, a total of 60% in the trial were due to treatment. Therefore, the probability that the survival of a patient on treatment in front of us was caused by the treatment is 60/60 = 1.0 as Erin suggests.

If in some other RCT, 30% survive on placebo and 60% survive on treatment, then 60-30 = 30% will have been caused to survive by treatment. Therefore, of the total 60% in the trial surviving on treatment, a total of 30% in the trial were due to treatment. Therefore, the probability that the survival of a patient on treatment in front of us was caused by the treatment is 30/60 = 0.5.

If in some other RCT, 1% survive on placebo and 60% survive on treatment, then 60-1 = 59% will have been caused to survive by treatment. Therefore, of the total 60% in the trial surviving on treatment, a total of 59% in the trial were due to treatment. Therefore, the probability that the survival of a patient on treatment in front of us was caused by the treatment is 59/60 = 0.98.

It is only when there is 0% survival on placebo that we can be sure that a patient is a responder. If the placebo response is >0%, then all we can do is estimate the probability that the patient was a responder. What do you think?

1 Like

I think this is just the relative risk reduction (1 - RR), right?

In the first paragraph, the risk ratio for death is 0.4/1 = 0.4. The relative risk ratio is 100(1-0.4) = 60%. The probability that treatment was the cause of survival when the patient survived is 0.6/0.6 = 1.

In the second paragraph the risk ratio for death is 0.4/0.7 = 0.57 and the relative risk ratio is 100(1-0.57) = 43%. The probability that treatment was the cause of survival when the patient survived is 0.3/0.6 = 0.5.

In the third paragraph the risk ratio for death is 0.4/0.99 = 0.404 and the relative risk ratio is 100(1-0.404) = 59.6%. The probability that treatment was the cause of survival when the patients survived is 0.59/0.6 = 0.983.

In other words, the RRR is the absolute risk reduction due to treatment of an adverse effect divided by its risk without treatment. However, the probability of treatment being the cause of avoiding the adverse event in someone who has avoided it is the absolute risk reduction divided by the probability of avoiding the adverse event on treatment.

1 Like

Sorry, I mistook survival for death and flipped it when doing the calculations. Thanks for clarifying!

2 Likes

This is a wonderful idea, and I support efforts to curb needless information reduction.

I have been thinking about why dichotomization occurs, not necessarily from a trialist perspective, but from a clinical perspective, which may then influence trials. Possibly, if we understand why dichotomization occurs, we can then better curb it.

Possibly others have been getting at this too, including post 24 - there might be some difficulty in thinking of continuous random variables in a clinical setting - and post 10 - which gets at it more from a trialist perspective, but still reflects the fact that part of the issue might be the complexities that suddenly appear when one starts thinking about continuous random variables. In fact, it may be rational in some ways to therefore reduce this complexity (time constraints, volume overload, etc).

I was going to suggest a few possible explanations, which build on these, if it would be relevant to this discussion. I think as is, it’s important to have a paper that rigorously shows the information consequence of dichotomization, so maybe thinking more about the etiology is outside of the scope. This said, I think that ultimately dichotomization might be not just a matter of statistical optimality, but of healthcare culture and healthcare politics - perhaps almost an adaptation to these things, and hence the difficulty in curbing it.

5 Likes

This is great - thank you. Are you OK with me sharing this with others including on bluesky?

Old topic, but I thought I would share that this discussion got me thinking about recurrent frustrations I have had with clinicians insisting that continuous outcomes, rather than predictors, are categorised or dichotomised. I thought a good way to address it would be to demonstrate explicitly, using simulation, how throwing that information away has an opportunity cost, not least because people then have to waste precious time reconstructing their priorities from their gestalt.

Paper for those interested: https://authors.elsevier.com/a/1m1D83BcJQMFcZ

5 Likes

Your paper applies directly to this one in my field where a wearable sensor developed algorithm was used to create 6 “risk” categories for fatal musculoskeletal injuries (FMI). Thank you for so much valuable information. Thoroughbreds deemed to be most at risk by inertial measurement unit sensors suffered a fatal musculoskeletal injury at a higher rate than other racehorses in: Journal of the American Veterinary Medical Association - Ahead of print

2 Likes

Thanks! It is always a pleasure to know not only that one’s work is appreciated but also that it is applicable to an important problem.

The paper is now out! https://onlinelibrary.wiley.com/doi/10.1002/sim.70402

8 Likes

Congratulations on the article. The app you developed to show the effects of dichotomization on sample size should help many researchers.

Arguably, the only way to eradicate entrenched bad practices is to trace them back to their origins and then pull them out by the root. The “Context” section of your paper describes how and why responder analysis became a mainstream practice. I looked up some of the papers you cited (e.g., Kieser). It seems like the people who first proposed responder analysis were trying to satisfy regulators’ demands for an analysis that could help them to gauge the clinical relevance of the effects shown by new drugs in RCTs.

Kieser notes (boldface is mine):

“A number of regulatory guidelines propose that clinical relevance should be assessed by considering the rate of responders, that is, the proportion of patients who are observed to achieve an apparently meaningful benefit…”

This paper, found through my own research on the history of responder analysis, represented early advocacy for the technique (which was closely linked to promotion of the of the concept of “number needed to treat”):

https://pmc.ncbi.nlm.nih.gov/articles/PMC1112685/#B13

Discussion of a fictional scenario might serve to highlight a key issue.

Hypothetical scenario:

An RCT shows a mean between-arm difference of 6 points, where the outcome is a continuous variable measured on a 100-point scale (the higher the score, the better the patient’s clinical state). The mean baseline-adjusted final score for those in the new drug arm was 6 points better than the final score in the other arm. The trial was considered positive because the sponsor and regulator had agreed, prior to conducting the trial, that a 5-point mean between-arm difference would be considered “clinically meaningful.”

The researchers then performed an additional analysis, during which they “drilled down” into each arm of the trial, plotting how the score of each patient had changed from the beginning of the trial to the end. They found that the scores of 50% of patients in the new drug arm had changed by only 1 point, 2 points, 3 points, or 4 points over the course of the trial, while the scores of 10% of patients changed by 10 points or more following exposure to the new drug.

Key conceptual question that lies at the heart of the “responder analysis” controversy: Does this observation mean that we can infer that the drug “worked exceptionally well” in 10% of patients and “barely at all” for 50% of patients in the trial? Why or why not?

I have my own opinion on the answer to this question and on the paper linked above. I’d be interested to hear the opinions of others on this site (with rationale presented).

1 Like

@Stephen has discussed that many times. I don’t think you can conclude that. But you can predict the distribution of the final response for an individual patient conditional on her starting point and do that for both treatments.

The irony of responder analysis is that it fails at its original goal, and the mean 6 point difference is more clinically relevant than any single-number responder analysis summary. The reason it fails is exemplified by this: Suppose that for a “responder” definition you find that 45% of control and 55% of treated patients have a satisfactory response. Is the difference between 45% and 55% clinically significant? You’re just replacing one metric with another one, and losing power and precision all along the way.

4 Likes

“I don’t think you can conclude that…The irony of responder analysis is that it fails at its original goal.”

Yes - I’ve read Dr.Senn’s articles and know he’s been screaming into the void about all this for many years:

The article linked in post #75 above seems, conceptually, horribly muddled to me- yet I fear that it’s impact might have been substantial…

Senn’s response and related publications:

https://link.springer.com/article/10.1177/009286150303700103#citeas

https://pmc.ncbi.nlm.nih.gov/articles/PMC524113/

https://errorstatistics.com/wp-content/uploads/2016/07/senn-2003-pharmaceutical_statistics.pdf

If you were to re-run the hypothetical trial described in post #75, you might obtain the same between-arm difference of 6 points. But this time, drilling down to see what happened to each individual patient’s score, you might see a completely different distribution of point score changes over the course of the trial. If you were to conclude, as a believer in “responder analysis,” based on analysis of your first trial, that “10% of patients exposed to this drug will respond exceptionally well,” how will you react when you repeat the trial and obtain the same between-arm difference of 6 points, but this time observe that nearly all patients’ scores changed by the same number of points over the course of the trial? If you had run the second trial before the first trial, you would NOT have concluded that the “worked exceptionally well” in 10% of patients, but rather that all patients “respond similarly.” This simple example illustrates the folly of the responder analysis approach and the importance of acknowledging the stochasticity in patients’ ostensible “response” to treatment, from one treatment episode to another.

Most importantly, the fact that a patient’s score changed over the course of the trial does NOT allow us to infer that the treatment he received caused that change, EVEN IF the treatment is one with established group-level efficacy. It’s valid to infer that the new drug “caused” the between-group/arm difference of 6 points (i.e., that the new drug caused one group’s score to wind up 6 points different from the other group’s score; we can say that the drug has meaningful intrinsic efficacy). But it’s NOT valid to “translate” that established group-level inference of efficacy to the level of individual patients enrolled in the trial (for the purpose of labelling them as “responders” or “non-responders”). For diseases with waxing/waning natural histories, replication (otherwise known as “crossover” or “positive dechallenge and/or rechallenge”) at the level of the individual is needed to establish causality at the level of individual patients. And since dechallenge/rechallenge/crossover is NOT a feature of most parallel group RCT designs, most trials do NOT allow us to make inferences of causality at the level of individual patients. Unless this erroneous, highly pernicious, and deeply entrenched conflation of group-level and individual-level causality is acknowledged and loudly criticized by statisticians, “responder analysis” will persist- and so will the practice that serves it: dichotomization of continuous endpoints.

5 Likes

Beautifully stated. This motivates me to add that an RCT estimates patient tendencies, and unless it’s a 6-period randomized crossover (or similar) design, it cannot estimate effects of treatments on individual patients. The best we can do is to covariate adjust down to a level of detail that current measurements allow, so that we can best predict what the treatment effect will be in certain types of patients in total (group-level estimates for very small groups, e.g., 73 year old males with stage 2 disease at baseline).

3 Likes