Dichotomization

EvZ · March 1, 2025, 5:09pm

I made a shiny app to do the calculations:

Before the trial: compute the required sample size, with and without dichotomizing the outcome.
After the trial: compute the loss of information from dichotomizing the outcome.

I hope this is helpful. Let me know! @ESMD @Pavlos_Msaouel

Pavlos_Msaouel · March 1, 2025, 5:13pm

Amazing - every trialist should take a look at this app!

ESMD · March 1, 2025, 5:34pm

This is a great idea! Show clinicians, concretely and simply, the consequences of bad outcome definitions for sample size- this seems like the kind of presentation that could make a real impact.

Sander · March 2, 2025, 6:00pm

Frank, WK 1986 points to such examples and the ones I have in mind. Look carefully at their Table 1: There, using a simple linear spline Q3 for the hazard can reduce the variance and RMSE for the survival function (which they label F) by huge amounts compared to the nonparametric KM estimator. Their Figure 1 pretty much explains why that happens. The only exception in their simulation has no censoring and is at the extreme of 10% survival; there, Q3 looks slightly worse but the difference is within their simulation error (and I think it reflects that they used suboptimal splines). In many other cases the improvement is a one-third or even two-thirds variance reduction, with large RMSE reductions as well.

Imagine now the task is to estimate (say) 1, 3 and 5 year survival under the different treatments, as well as the differences in these survivals, not just hazard ratios. A misleading statistical study would only compare semi-parametric PH vs. some arbitrary rigid parametric model under the assumption that the hazards are proportional. The statistical comparisons I want to see will cover a range of cases in which the hazards are not close to proportional over time, to allow for the fact that such an assumption is never certain in real human trials and usually has no basis whatsoever in data (claims it does have a basis arise from the fallacy of inferring it must hold because the P-value for it exceeds 0.05, failing to notice that the test has very little power in common trials and that many nonproportional models would fit just as well). The comparisons will contrast at least 3 approaches: nonparametric vs. the standard PH semi-parametric vs. flexible parametric, where the latter include time in a form flexible enough to allow both treatment and control hazards to get close to anything reasonable, but not overparameterized (e.g., using simple quadratic or penalized time splines in both the baseline hazard and the hazard-ratio function of a PH model).

I think results like WK 1986 show that such flexible parametric approaches are preferable nonparametric and semi-parametric approaches for estimating survival probabilities and differences, in that in practical terms they will always be competitive and will often gain a lot in typical finite-sample trials over the conventional semiparametric model. I attribute that to the fact that the latter unrealistically places all its constraints on the hazard ratios and none on the baseline hazard, so that the conventional model is underparameterized for hazard ratios and overparameterized for the baseline hazard; the overparameterization is however hidden by looking only at hazard ratios.

Stephen · March 4, 2025, 11:04am

Agreed that responder and non-responder are not ideal terms. Unfortunately they are commonly used in clinical trials. Users, if pushed, will admit that responder means ‘is observed to be beyond some threshold’ rather than ‘is caused by’ but will then frequently lapse automatically into causal non-sequiturs. Quite apart from the loss in efficiency due to dichotomising change from baseline to classify individuals as responders or non-responders, the causal confusion is responsible for much misunderstanding (and exaggeration) of the scope for personalised medicine. See Responder Despondency and Mastering Variation
However, your point is well-taken. Creation of responder non-responder labels requires dichotomisation but not all dichotomisation is carried out to award responder/ non-responder labels.

Lawrence_Lynn · March 4, 2025, 12:00pm

I agree with Erin that examples for clinicians would be the best way to present this. I provide an highly illustrative example here of bands (late immature neutrophil) the most common white blood cell as a biomarker for severe infection.

https://discourse.datamethods.org/t/statistics-applied-using-threshold-science-fails-to-deliver-progression-toward-truth/4940?u=lawrence_lynn

An understanding of the pathophysiology of the relational time series of bandemia and the white blood cell count or absolute neutrophil count is pivotal to understanding the AUC of these biomarkers and the information loss from dichotomization of them.

This is a beautiful example because the white blood cell count or absolute neutrophil count often rises first and then falls as infection worsens because the WBC is a depletable biomarker since the flux of cells exiting the vascular system to the site of infection may exceed bone marrow release. The bands, having an immature and less flexible nucleus, cannot exit between the endothelial cells as easily so their numbers and their percentage of total neutrophils in the blood rise.

Here you see how dichotomization can render the wrong result because the threshold falls on the rise side of the WBC and then is lost on the more severe fall side of the WBC (see graph). Meanwhile the bands often progressively rise making the bands the only elevated marker late in the infection.

A lack of understanding of the AUC resulted in abandonment of the measurement of bands in many hospitals around the turn of the century due to misinterpretation of the comparison of the AUCs of the absolute neutrophil count (ANC) and the bands which showed a greater AUC for the absolute neutrophil count so they thought they did not need bands.

Here you see the incorrect view that one can discover the “best biomarker” using only the instant value and generating comparison AUCs. Based on this thinking workers concluded that ANC or WBC was superior to AND could replace bands. Many centers abandoned bands based on this false interpretation of the AUC.

But the ANC is high early and the band count is high late so this was a misunderstanding and o
lack of consideration of time series information by the AUC which does not include the time domain. In other words these AUCs were dependent on the severity of infection at the time of measurement.

The ill-advised abandonment of bands as a biomarker in many hospitals has been a major loss of bedside information because it is an excellent and almost free marker for mortality. In contrast, the proprietary replacement, procalcitonin (PCT), is a poor marker for mortality. The sensitivity of bandemia is low but their presence in high numbers a vary useful marker, particularly in adults.

So showing the examples, especially this one where the information of an important and
Inexpensive and non proprietary biomarker was lost to many hospitals due to discrimination and failure to consider the time course, would go a long way for clinicians who may become lost in more complex analysis.

Some human predating bacteria like group A streptococcus and N. meningococcus generate a massive ANC and band response and then severe ANC depletion. The WBC (or ANC) threshold may not be breached if they arrive late to the hospital because the value has fallen back through the threshold as the infection worsens.

Here a lack of informed consideration of the band count and a lack of consideration of the time series relationships can result in diagnostic delay and death.

This is an excellent reference which discusses the pathophysiology of bandemia.

At Univ of Missouri, where this article is from, I recall a young recruit from Fort Leonard Wood. The boy hailed from Hawaii to serve his country and died from meningococcal infection. Such a harsh climate and an environment enclosed with the other recruits.

This is the desolate back country of western Missouri the land of the bush whackers the allies of Jessie James of the civil war. This young man, who could have stayed and enjoyed the warm North Shore, gave his life for his country like so many young Missouri boys who hurried to defend Hawaii in WWII.

These recruits need us to spend the time to have a deep understanding of these time series relationships. They deserve more than simplistic dichotomy based thinking.

——-
Note: Places like U of Mo., being juxtaposed the army base which trains recruits, is the perfect place to see enough of the cases to provide the lessons physcians and statisticians need to learn. Stopping the funding for this fundamental work must be prevented.

f2harrell · March 4, 2025, 4:12pm

Sander that has been clear for a long time and is not what I’m referring to. Stratified K-M estimates are extremely noisy. Covariate-varying effects in semiparametric models are virtually as efficient as similar parametric model estimators. The worst efficiency I know from a semiparametric model is estimating median(Y) which has relative efficiency \frac{2}{\pi} when X is categorical and you want to compare with a Gaussian mean as a median estimator. A semiparametric model’s estimate of the mean and of survival probabilities on the other hand is very efficient.

Lawrence_Lynn · March 4, 2025, 8:10pm

This is a wonderful planned paper but please suffer me as I play the devil’s advocate and suggest that continuous outputs are well understood as a means to “bail out” weak clinical signals. (By this I mean it is understood as a technique for that purpose in general, not in technical/mathematical terms, as these approaches are certainly beyond my understanding).

However, using antibiotics as the an illustrative “high clinical signal generator”, we have a microbiological cure as one side of the output dichotomy.

In contrast we have anti amyloid antibodies as a “weak clinical signal generator”. The output is continuous.

This is not to say that weak signals are not important but they are hard to define clinically if not statistically. However, the point is that, the issue of output information loss is well known.

The larger issue is the information loss from input dichotomization. This is completely misunderstood by clinicians.

It be great if your paper would address both.

Sander · March 5, 2025, 3:15pm

Frank: Clearly we are talking past each other in addition to being way off the topic of this thread. I think you are missing my point about the deficiencies of conventional semiparametric PH modeling when the target effect (causal estimand) is a clinically relevant measure with strong dependence on absolute risks, like the time-specific survival differences. That problem is an immediate consequence of the noisiness of K-M, and is a major problem in small trials with considerable censoring (of which there are many). Your own ultra-simple example displays a 1-2/pi = 36% efficiency loss, which is considerable; the losses shown in WK 1986 reach up to 90%.

As a point of both logic and practice recommendations, the burden of proof is on you to support your general claim that a “covariate-varying effects in semiparametric models are virtually as efficient as similar parametric model estimators”. Such a sweeping claim requires a comprehensive simulation study to show how it holds in an all-inclusive array of nonproportional-hazard cases, with estimands that are functions of the absolute hazards (like risk and survival differences), and with compared groups like those displayed in WK 1986 (small samples and non-negligible censoring). You have yet to provide anything approaching that kind of study.

I am pretty sure such a study won’t support your claim, because (as I pointed out earlier) realistic counterexamples to your claim follow directly from taking differences between groups shown in Table 1 of WK 1986. After all, the variances of those differences are just the sum of the variances, which shows immediately how even the relatively weak parametric modeling used by WK can vastly improve small-sample efficiency for the differences.

In light of these basic facts, and in the absence of a comprehensive study, I can only view your claim as another example of a statistical urban legend which arose in the ascendancy of multiplicative models in the 1960s-70s. That legend holds within the very narrow confines of those models when ratio comparisons are the prime targets and absolute baselines are treated as “nuisance functions” or “nuisance parameters” (a term which blinds the theory and user to the importance of absolute risks).

None of this justifies use of other, more ill-behaved models (like additive hazards), but we should encourage more efficient forms of PH models that shift much of the parametric restriction (model df) from the proportionality function to the baseline hazard. This vision is I think clear in the Cox & Oakes 1984 survival-analysis text. Over 40 years later it needs to be more widely appreciated, not opposed because it goes against special cases that were mistakenly treated as establishing a general fact.

f2harrell · March 6, 2025, 5:09pm

I hope to find the kind of time you devote to this topic while I’m trying not to be provoked by your writing style. My experience is with relative measures such as hazard ratios and odds ratios and the evidence is more than sufficient for the abundant efficiency of semiparametric estimators in that setting. As long as we keep classes of estimators (relative estimands vs absolute estimands for example) separate I think we are much in agreement. Regarding absolute estimands, the ECDF is a useful case study. It like Kaplan-Meier seems to be very inefficient but once you account for all the model uncertainty is fitting parametric models, the ECDF is not all that bad. It’s roughly equivalent in precision to fitting a 3 to 4 parameter distribution.

Sander · March 6, 2025, 7:51pm

Sorry Frank about the tone, but it’s because each of your responses seem to add more statements that I can’t make sense of. You wrote “As long as we keep classes of estimators (relative estimands vs absolute estimands for example) separate I think we are much in agreement”. That ought to be the case, but then you wrote that “once you account for all the model uncertainty [in] fitting parametric models the ECDF is not all that bad” - if by that you mean fitting a parametric model flexible enough to approximate anything reasonable, it looks to me like Whittemore and Keller did that and still found huge improvements over the fully nonparametric-baseline (KM) case, which seems to contradict your statement.

WK’s simulations also look to me like they contradict that ECDF is “roughly equivalent to fitting a 3 or 4 parameter distribution”: WK’s splines have more parameters and still vastly outperform KM in small to modest samples. Even if your statement were roughly correct, it wouldn’t tell me that parametrics can’t do better than the ECDF, because not all k-parameter distributions are sensible or equivalent in terms of bias or efficiency.

I therefore have no choice but to ask for precise justifications for your statements and explanations as to why they seem to contradict the WK simulations (as well as my own experience). If I am mistaken in seeing the WK simulations as refuting your statements, I would be grateful if you explained my mistake. On the other hand, if I have made no mistake, then my only guess is your statements arose from considering cases in which N is large enough for the asymptotics to dominate performance, censoring is light or absent, and the underlying generative model minimized the inefficiencies of ECDF.

I will offer one general rule: I think the conditions needed to justify a statement need to be given clearly up front so that the statement isn’t overgeneralized and the exceptions are recognizable. I don’t think this kind of demand is any different or less sensible than what is demanded for safe generalizations in engineering or medical practice, and it is an absolute requirement in mathematical theory. That’s all I’m asking for now.

f2harrell · March 6, 2025, 8:43pm

Sander it is flaws in the Miller paper referenced by WK that got me into this area a long time ago (Miller showed that a pre-specified 1-parameter parametric survival model that had the oracle property that it perfectly fit the data is much better than K-M). The Q3 result in WK is surprising to me but I think you may get similar gains if you just connected K-M points with polygons instead of flat line segments.

I had 2 PhD students who developed & published spline proportional hazards models and I do like those models a lot.

Sander · March 6, 2025, 9:27pm

I did not see a resolution of your statements with the WK results, so we’ll leave those statements as items in dispute as generalities (as opposed to useful special cases).

Like you I’d expect that connecting the event times by polygons would work better than KM (given that KM is vastly overparameterized in the sense of ignoring even basic uniform continuity). I would also expect further and easier improvement from using a simple smooth spline like a quadratic with a few fixed knots, because that could approximate any “reasonable expectation” (such as smoothness) and leave more df where it is most needed and neglected: in the proportionality function. It’s also easy to implement, and the usual fixed-parameter theory applies. The way I see it, this is a very low-risk, low-labor modification of conventional survival modeling, one which can considerably improve small-sample behavior while reducing dependence on assumption about effects (such as time-independent hazard ratios) that have no basis in data. It thus addresses the point that it is wasteful of hard-won data to not exploit easy gains via upgraded analysis methods.

f2harrell · March 6, 2025, 10:08pm

I would love to see a graduate student take that on. My prediction is that a polygon KM corner cutting method will have mean squared error that only suffers by a negligible amount when compared to a smooth method. Let’s test that. Someone could start with an easier task: pick a distribution family that goes from 1-4 parameters that have to be estimated by MLE and see how many parameters are need to make MSE the same as KM, as a function of N. This is essentially a more meaningful redo of the Miller “What Price Kaplan-Meier” paper.

EvZ · March 7, 2025, 8:45am

Lawrence: That’s a really nice example. Our paper focuses mainly on the loss of statistical information due to dichotomizing the outcome in a clinical trial. We quantify this loss in terms of sample size. The comments on this blog have made it clear to me that our focus is quite narrow. We should at least mention some of the many other problems with dichotomization ranging from causal misunderstandings to the loss of clinical information that you point out.

I’m not so sure about your point that continuous outcomes are a way to boost a weak signal. I do agree that some trials have (continuous) outcomes with little clinical relevance, but seems like a separate issue.

Sander · March 7, 2025, 1:45pm

Frank: My question would now be: Why is that comparison of any practical interest? The polygon assumes continuity, and given continuity I fully expect smoothness to hold as well; so what is the advantage of a polygon? It doesn’t satisfy even first-order smoothness, nor does it satisfy the prespecification assumption used by ordinary software to produce the interval estimates, whereas a prespecified quadratic spline does. So, if there is no meaningful difference in MSE between the two, why bother with the polygon?

On the other hand, it might be of interest to see how many parameters make the MSE the same as KM, provided (1) those parameters are in a model I’d actually use (a smooth one like a quadratic spline) and (2) the sample sizes examined include relatively small ones so the results aren’t just reflecting asymptotic theory (which as seen in the WK simulations can be very misleading in small to modest samples).

f2harrell · March 7, 2025, 4:01pm

Total agree re: N selected for simulations. I doubt that there would be much difference in MSE between spline and flexible parametric distributions. I’m less concerned about first-order smoothness. I like step functions because they handle clumping in the data / floor and ceiling effects / true discontinuities in the measurement process like we see in patient reported outcomes and outcomes representing mixture of continuous and discrete components. As a slight aside, step functions have an efficiency of 1.0 for estimating means because they give rise the sample mean (when the sample mean is a good thing to estimate).

Sander · March 7, 2025, 7:38pm

Well at least we agree on N. The rest raises yet more issues…

Please explain:
“I doubt that there would be much difference in MSE between spline and flexible parametric distributions.”
??!! For me there is no distinction at all: I was taught that splines are examples of flexible parametric distributions, so they are what I think of first when discussing flexible trend curves. Also, if the knots are pre-specified, we can fit splines with ordinary software and use the usual outputs (P-values and interval estimates) without added complications. Furthermore, we can include step functions among the options because those are zero-degree splines. So with flexible parametric modeling we have the options of not imposing continuity (zero-degree/step-function splines), imposing continuity but not smoothness (first-degree/linear splines, zero-order smoothness), imposing minimal smoothness (second-degree/quadratic splines, first-order smoothness), or imposing smooth first derivatives (third-degree/cubic splines, second-order smoothness). Terminology may vary.

“I like step functions because they handle clumping in the data / floor and ceiling effects / true discontinuities”.
-Polygons are continuous and so would be knocked out by your preference, leaving me to wonder even more why you mentioned them.
When we expect floors, ceilings, change-points or discontinuities (e.g., from changes in data codings), they can be handled by assigning parameters to them. Otherwise, if we allow the data to pick their number and location we should use statistics that account for this nonstandard flexibility. If however the jumps are artefact of the estimator while the underlying target process is continuous, then the process estimator can be improved by making it continuous, thus eliminating the artefact.

How to handle clumping should also depend on context: If the clumping is a measurement artefact and the underlying target process has none, then the process estimator can be improved by eliminating the artefact. If on the other hand the clumping is part of the target process (e.g., as in repeated outbreaks) then, as with discontinuities, I would model the clumping process directly, as done in spatial epidemic modeling (which usually assumes underlying continuity).

Bottom line is that I cannot see any reason for particularly liking step functions - they are just another model option, one which I think is vastly overused, and often could be easily replaced to achieve worthwhile gains in realism, efficiency, and accuracy.

Lawrence_Lynn · March 7, 2025, 8:21pm

Thankyou. Yes this entire area is so important for clinicians to understand.

Clinicians are in completely different silos so providing the simplest explanation is pivotal (as Erin points out).

Many are easily confused as was evident in the case of bandemia analysis by AUC I presented.

Clinicians ultimately have to make a substantially dichotomous decision to treat or not to treat. Many use a technique I call “Bayesian gestalt”. In other words they consider the instant patient related priors and the extant (published) outcome data to make a decision.

Dichotomization of the output reduces the ability to engage in this level of bedside decision making.

I am very happy to see you are doing this work and hope you will publish a paper which is readily consumable by clinicians.

Note: it was not my intent to say the continuous output “boosts” a weak signal but rather that a continuous output is necessary to reliably see a weak signal.

Although, of course, gross, negligent or poorly conceived or derived dichotomization can hide even a strong signal.

I look forward to your work.

Ahmed_Sayed · March 7, 2025, 9:15pm

This is a really excellent post!

Just two points:

Sometimes researchers will be working with outcomes that do not follow a normal distribution (e.g., the results of a questionnaire which are bounded by some lower and upper limit). Do we know if the loss of efficiency is any more or less pronounced with these? Comparing a binary test to either a t-test applied to non-normal data or a more appropriate model. I realize this point might be a separate paper all on its own though.
Recently I’d come across an interesting paper where an unimpressive mean difference was accompanied by an impressive difference in the proportion of responders (the outcome was how well patients scored on a questionnaire).

The authors do acknowledge that comparing continuous means does improve statistical power compared to dichotomized outcomes. However, they also note that, were one to judge the treatment effect by the more statistically precise measure (difference in mean scores), one would overlook an important difference which is only apparent with the less statistically precise measure (proportion of responders). I think it would be good to address concerns like these where there’s a disconnect between the implications of the 2 measures.

My guess is that an ordinal model would be able to provide you with the best of both worlds: minimal loss of statistical efficiency (if any; sometimes considerable gains could be made) + the ability to alternatively communicate the result in a manner which some would find easier to interpret.