Dichotomization

Sander that has been clear for a long time and is not what I’m referring to. Stratified K-M estimates are extremely noisy. Covariate-varying effects in semiparametric models are virtually as efficient as similar parametric model estimators. The worst efficiency I know from a semiparametric model is estimating median(Y) which has relative efficiency \frac{2}{\pi} when X is categorical and you want to compare with a Gaussian mean as a median estimator. A semiparametric model’s estimate of the mean and of survival probabilities on the other hand is very efficient.

1 Like

This is a wonderful planned paper but please suffer me as I play the devil’s advocate and suggest that continuous outputs are well understood as a means to “bail out” weak clinical signals. (By this I mean it is understood as a technique for that purpose in general, not in technical/mathematical terms, as these approaches are certainly beyond my understanding).

However, using antibiotics as the an illustrative “high clinical signal generator”, we have a microbiological cure as one side of the output dichotomy.

In contrast we have anti amyloid antibodies as a “weak clinical signal generator”. The output is continuous.

This is not to say that weak signals are not important but they are hard to define clinically if not statistically. However, the point is that, the issue of output information loss is well known.

The larger issue is the information loss from input dichotomization. This is completely misunderstood by clinicians.

It be great if your paper would address both.

Frank: Clearly we are talking past each other in addition to being way off the topic of this thread. I think you are missing my point about the deficiencies of conventional semiparametric PH modeling when the target effect (causal estimand) is a clinically relevant measure with strong dependence on absolute risks, like the time-specific survival differences. That problem is an immediate consequence of the noisiness of K-M, and is a major problem in small trials with considerable censoring (of which there are many). Your own ultra-simple example displays a 1-2/pi = 36% efficiency loss, which is considerable; the losses shown in WK 1986 reach up to 90%.

As a point of both logic and practice recommendations, the burden of proof is on you to support your general claim that a “covariate-varying effects in semiparametric models are virtually as efficient as similar parametric model estimators”. Such a sweeping claim requires a comprehensive simulation study to show how it holds in an all-inclusive array of nonproportional-hazard cases, with estimands that are functions of the absolute hazards (like risk and survival differences), and with compared groups like those displayed in WK 1986 (small samples and non-negligible censoring). You have yet to provide anything approaching that kind of study.

I am pretty sure such a study won’t support your claim, because (as I pointed out earlier) realistic counterexamples to your claim follow directly from taking differences between groups shown in Table 1 of WK 1986. After all, the variances of those differences are just the sum of the variances, which shows immediately how even the relatively weak parametric modeling used by WK can vastly improve small-sample efficiency for the differences.

In light of these basic facts, and in the absence of a comprehensive study, I can only view your claim as another example of a statistical urban legend which arose in the ascendancy of multiplicative models in the 1960s-70s. That legend holds within the very narrow confines of those models when ratio comparisons are the prime targets and absolute baselines are treated as “nuisance functions” or “nuisance parameters” (a term which blinds the theory and user to the importance of absolute risks).

None of this justifies use of other, more ill-behaved models (like additive hazards), but we should encourage more efficient forms of PH models that shift much of the parametric restriction (model df) from the proportionality function to the baseline hazard. This vision is I think clear in the Cox & Oakes 1984 survival-analysis text. Over 40 years later it needs to be more widely appreciated, not opposed because it goes against special cases that were mistakenly treated as establishing a general fact.

1 Like

I hope to find the kind of time you devote to this topic while I’m trying not to be provoked by your writing style. My experience is with relative measures such as hazard ratios and odds ratios and the evidence is more than sufficient for the abundant efficiency of semiparametric estimators in that setting. As long as we keep classes of estimators (relative estimands vs absolute estimands for example) separate I think we are much in agreement. Regarding absolute estimands, the ECDF is a useful case study. It like Kaplan-Meier seems to be very inefficient but once you account for all the model uncertainty is fitting parametric models, the ECDF is not all that bad. It’s roughly equivalent in precision to fitting a 3 to 4 parameter distribution.

Sorry Frank about the tone, but it’s because each of your responses seem to add more statements that I can’t make sense of. You wrote “As long as we keep classes of estimators (relative estimands vs absolute estimands for example) separate I think we are much in agreement”. That ought to be the case, but then you wrote that “once you account for all the model uncertainty [in] fitting parametric models the ECDF is not all that bad” - if by that you mean fitting a parametric model flexible enough to approximate anything reasonable, it looks to me like Whittemore and Keller did that and still found huge improvements over the fully nonparametric-baseline (KM) case, which seems to contradict your statement.

WK’s simulations also look to me like they contradict that ECDF is “roughly equivalent to fitting a 3 or 4 parameter distribution”: WK’s splines have more parameters and still vastly outperform KM in small to modest samples. Even if your statement were roughly correct, it wouldn’t tell me that parametrics can’t do better than the ECDF, because not all k-parameter distributions are sensible or equivalent in terms of bias or efficiency.

I therefore have no choice but to ask for precise justifications for your statements and explanations as to why they seem to contradict the WK simulations (as well as my own experience). If I am mistaken in seeing the WK simulations as refuting your statements, I would be grateful if you explained my mistake. On the other hand, if I have made no mistake, then my only guess is your statements arose from considering cases in which N is large enough for the asymptotics to dominate performance, censoring is light or absent, and the underlying generative model minimized the inefficiencies of ECDF.

I will offer one general rule: I think the conditions needed to justify a statement need to be given clearly up front so that the statement isn’t overgeneralized and the exceptions are recognizable. I don’t think this kind of demand is any different or less sensible than what is demanded for safe generalizations in engineering or medical practice, and it is an absolute requirement in mathematical theory. That’s all I’m asking for now.

Sander it is flaws in the Miller paper referenced by WK that got me into this area a long time ago (Miller showed that a pre-specified 1-parameter parametric survival model that had the oracle property that it perfectly fit the data is much better than K-M). The Q3 result in WK is surprising to me but I think you may get similar gains if you just connected K-M points with polygons instead of flat line segments.

I had 2 PhD students who developed & published spline proportional hazards models and I do like those models a lot.

I did not see a resolution of your statements with the WK results, so we’ll leave those statements as items in dispute as generalities (as opposed to useful special cases).

Like you I’d expect that connecting the event times by polygons would work better than KM (given that KM is vastly overparameterized in the sense of ignoring even basic uniform continuity). I would also expect further and easier improvement from using a simple smooth spline like a quadratic with a few fixed knots, because that could approximate any “reasonable expectation” (such as smoothness) and leave more df where it is most needed and neglected: in the proportionality function. It’s also easy to implement, and the usual fixed-parameter theory applies. The way I see it, this is a very low-risk, low-labor modification of conventional survival modeling, one which can considerably improve small-sample behavior while reducing dependence on assumption about effects (such as time-independent hazard ratios) that have no basis in data. It thus addresses the point that it is wasteful of hard-won data to not exploit easy gains via upgraded analysis methods.

1 Like

I would love to see a graduate student take that on. My prediction is that a polygon KM corner cutting method will have mean squared error that only suffers by a negligible amount when compared to a smooth method. Let’s test that. Someone could start with an easier task: pick a distribution family that goes from 1-4 parameters that have to be estimated by MLE and see how many parameters are need to make MSE the same as KM, as a function of N. This is essentially a more meaningful redo of the Miller “What Price Kaplan-Meier” paper.

1 Like

Lawrence: That’s a really nice example. Our paper focuses mainly on the loss of statistical information due to dichotomizing the outcome in a clinical trial. We quantify this loss in terms of sample size. The comments on this blog have made it clear to me that our focus is quite narrow. We should at least mention some of the many other problems with dichotomization ranging from causal misunderstandings to the loss of clinical information that you point out.

I’m not so sure about your point that continuous outcomes are a way to boost a weak signal. I do agree that some trials have (continuous) outcomes with little clinical relevance, but seems like a separate issue.

1 Like

Frank: My question would now be: Why is that comparison of any practical interest? The polygon assumes continuity, and given continuity I fully expect smoothness to hold as well; so what is the advantage of a polygon? It doesn’t satisfy even first-order smoothness, nor does it satisfy the prespecification assumption used by ordinary software to produce the interval estimates, whereas a prespecified quadratic spline does. So, if there is no meaningful difference in MSE between the two, why bother with the polygon?

On the other hand, it might be of interest to see how many parameters make the MSE the same as KM, provided (1) those parameters are in a model I’d actually use (a smooth one like a quadratic spline) and (2) the sample sizes examined include relatively small ones so the results aren’t just reflecting asymptotic theory (which as seen in the WK simulations can be very misleading in small to modest samples).

Total agree re: N selected for simulations. I doubt that there would be much difference in MSE between spline and flexible parametric distributions. I’m less concerned about first-order smoothness. I like step functions because they handle clumping in the data / floor and ceiling effects / true discontinuities in the measurement process like we see in patient reported outcomes and outcomes representing mixture of continuous and discrete components. As a slight aside, step functions have an efficiency of 1.0 for estimating means because they give rise the sample mean (when the sample mean is a good thing to estimate).

Well at least we agree on N. The rest raises yet more issues…

Please explain:
“I doubt that there would be much difference in MSE between spline and flexible parametric distributions.”
??!! For me there is no distinction at all: I was taught that splines are examples of flexible parametric distributions, so they are what I think of first when discussing flexible trend curves. Also, if the knots are pre-specified, we can fit splines with ordinary software and use the usual outputs (P-values and interval estimates) without added complications. Furthermore, we can include step functions among the options because those are zero-degree splines. So with flexible parametric modeling we have the options of not imposing continuity (zero-degree/step-function splines), imposing continuity but not smoothness (first-degree/linear splines, zero-order smoothness), imposing minimal smoothness (second-degree/quadratic splines, first-order smoothness), or imposing smooth first derivatives (third-degree/cubic splines, second-order smoothness). Terminology may vary.

“I like step functions because they handle clumping in the data / floor and ceiling effects / true discontinuities”.
-Polygons are continuous and so would be knocked out by your preference, leaving me to wonder even more why you mentioned them.
When we expect floors, ceilings, change-points or discontinuities (e.g., from changes in data codings), they can be handled by assigning parameters to them. Otherwise, if we allow the data to pick their number and location we should use statistics that account for this nonstandard flexibility. If however the jumps are artefact of the estimator while the underlying target process is continuous, then the process estimator can be improved by making it continuous, thus eliminating the artefact.

How to handle clumping should also depend on context: If the clumping is a measurement artefact and the underlying target process has none, then the process estimator can be improved by eliminating the artefact. If on the other hand the clumping is part of the target process (e.g., as in repeated outbreaks) then, as with discontinuities, I would model the clumping process directly, as done in spatial epidemic modeling (which usually assumes underlying continuity).

Bottom line is that I cannot see any reason for particularly liking step functions - they are just another model option, one which I think is vastly overused, and often could be easily replaced to achieve worthwhile gains in realism, efficiency, and accuracy.

Thankyou. Yes this entire area is so important for clinicians to understand.

Clinicians are in completely different silos so providing the simplest explanation is pivotal (as Erin points out).

Many are easily confused as was evident in the case of bandemia analysis by AUC I presented.

Clinicians ultimately have to make a substantially dichotomous decision to treat or not to treat. Many use a technique I call “Bayesian gestalt”. In other words they consider the instant patient related priors and the extant (published) outcome data to make a decision.

Dichotomization of the output reduces the ability to engage in this level of bedside decision making.

I am very happy to see you are doing this work and hope you will publish a paper which is readily consumable by clinicians.

Note: it was not my intent to say the continuous output “boosts” a weak signal but rather that a continuous output is necessary to reliably see a weak signal.

Although, of course, gross, negligent or poorly conceived or derived dichotomization can hide even a strong signal.

I look forward to your work.

1 Like

This is a really excellent post!

Just two points:

  1. Sometimes researchers will be working with outcomes that do not follow a normal distribution (e.g., the results of a questionnaire which are bounded by some lower and upper limit). Do we know if the loss of efficiency is any more or less pronounced with these? Comparing a binary test to either a t-test applied to non-normal data or a more appropriate model. I realize this point might be a separate paper all on its own though.

  2. Recently I’d come across an interesting paper where an unimpressive mean difference was accompanied by an impressive difference in the proportion of responders (the outcome was how well patients scored on a questionnaire).

The authors do acknowledge that comparing continuous means does improve statistical power compared to dichotomized outcomes. However, they also note that, were one to judge the treatment effect by the more statistically precise measure (difference in mean scores), one would overlook an important difference which is only apparent with the less statistically precise measure (proportion of responders). I think it would be good to address concerns like these where there’s a disconnect between the implications of the 2 measures.

My guess is that an ordinal model would be able to provide you with the best of both worlds: minimal loss of statistical efficiency (if any; sometimes considerable gains could be made) + the ability to alternatively communicate the result in a manner which some would find easier to interpret.

1 Like

Sorry I disagree with most of what you wrote. Let’s drop this as it’s not very productive, but return to it when we have some practical demonstration projects in mind.

Ahmed: Great points! To your first point: we know that the Mann-Whitney test has less power than the t-test. So, the difference between a proportions test and the Mann-Whitney test would be less than the difference between a proportions test and the t-test. However, we might be able to use some other parametric model for the continuous data, such as the log normal distribution. That would restore the difference.

To your second point: it’s true that the difference in means can be small, while the difference in “responders” is large. This can happen when the treatment is good for some and bad for others. In such cases, it would be especially important to look at the complete (un-dichotomized) outcome.

Ordinal models are indeed a very attractive. I know that Frank is advocating for them!

2 Likes

Hi Ahmed

I can only see the abstract of the paper you linked to, but it seems like this might be a great example of several conceptual misunderstandings relevant to this thread.

The authors cite “change from baseline” analyses within treatment arms and suggest that patients whose scores “changed” by more than a certain amount were those who showed a clinically important “response.” I’ll try to show below how conflation of “change from baseline” with the word “response” can lead to wrong inferences by clinicians.

In a recent short correspondence published by Dr.Senn, I realized that statisticians seem to have a different interpretation of the word “response” than clinicians do and I wonder if this is a major source of confusion for clinicians (?) Specifically, he flags a distinction between placebo effect and placebo response (https://www.nature.com/articles/s41398-025-03263-0I). He seems to be equating a change from baseline with the term “response,” while at the same time saying that “response” does not necessarily have a causal connotation. This is problematic, since, for clinicians, the term “response” ALWAYS has a causal connotation (!) If you tell us that a patient has “responded” to treatment, we will infer that his treatment caused him to improve.

Dr.Senn’s writing highlights that differing interpretations of the word “response” between statisticians and clinicians might underlie several bad statistical practices involving clinicians e.g., “number-needed-to-treat,” “responder analysis,” “randomized non-comparative trials.” Statisticians seem to conflate the term “response” with “change from baseline,” whereas clinicians conflate the term “response” with individual-level causality or “effect” (e.g,. “placebo effect”). In other words, for statisticians “change from baseline”=“response” and for clinicians “response”=causality; this is why clinicians end up erroneously conflating changes from baseline with the notion of individual-level causality.

The paper you linked to shows how interprofessional differences in interpretation of the word “response” can lead to badness. Heart failure is a condition for which signs and symptoms tend to wax and wane over time, sometimes from week to week or day to day, often in the absence of any changes to a patient’s medication regimen. Therefore, it wouldn’t surprise too many physicians if a patient’s score on a clinical questionnaire were to change from one day to the next, even if none of his medications had been changed. In other words, heart failure is an example of a condition for which treatment dechallenge/rechallenge (via an N of 1 or crossover trial) would be needed in order to validly infer causality at the level of an individual patient.

When discussing changes from baseline in any context, it would probably be best for statisticians to avoid labelling such changes “responses,” EXCEPT in the very specific clinical scenarios, listed in post #28 of this thread, for which we have a clinical basis for inferring individual causality even though we have not observed a positive dechallenge or rechallenge (e.g., a malignant tumour shrinking with treatment exposure). For diseases with waxing/waning natural histories (e.g., CHF), observing a “change from baseline” in a patient’s clinical score during a single period of exposure in an RCT will be insufficient for us to infer that the exposure caused the change. But if we persist in labeling such changes from baseline “responses,” and patients who show these changes “responders,” many clinicians will invalidly infer, based on a single period of observation, that the exposure caused the change in that particular patient (even though the requirement for positive dechallenge/rechallenge has not been fulfilled).

The longer this thread goes on, the more it seems that the terms “responder” and “response” should be abolished entirely from statistical lingo.

1 Like

I totally agree that the term ‘responder’ is not a good description as used in the paper. The course of heart failure is volatile enough where 5-point changes in the score are seen as part of the typical natural history of the disease (“response” rates in the placebo arm vary from ~30ish to ~50ish percent as per table 1). The use of “change from baseline” is of course another issue.

I do think, as @EvZ said (and as you probably agree), you can infer the existence of treatment effect heterogeneity if there is a sizeable difference in the proportion of patients who experience unusually large improvements despite a modest average difference between the groups.

In the case of the paper I cited, it appears this need not be the case. The results are completely compatible with all participants deriving the same small improvement. One of the figures in the paper contrasts a seemingly small mean difference of 2 points to a seemingly large difference in the proportion of “responders” of about ~5%.

Assuming the change-from-baseline treatment effect is a small 2-point change for all participants, and assuming that the standard deviation of change-from-baseline is 15 (representing the waxing and waning of the heart failure’s natural history; similar to the value reported here),

You can expect differences similar to those reported by the authors (code to simulate is below for those interested; the figure follows after):

n <- 10000
set.seed(1)
rnorm(n = 10000, mean = 2, sd = 15) -> d1
d1 - 2 -> d2
min_diff <- 5
length(d1[d1 > min_diff])/length(d1) -> prop1
length(d2[d2 > min_diff])/length(d2) -> prop2
prop_diff <- (prop1 - prop2)
prop_diff
1/prop_diff
data <- data.frame(change_score = c(d1, d2),
                   arm = rep(c("treatment", "placebo"), each = length(n))
                   )

library(ggplot2)
ggplot(data = data,
       aes(y = change_score,
           x = arm,
           color = arm)) +
  geom_point() +
  geom_hline(yintercept = 5) +
  annotate(x = 1.5, y = 30,
           geom = "text", fontface = "bold",
           label = paste0("Mean difference: ", d1 - d2, " points\n",
                          "Proportion of responders: ", round(100*prop1), "% vs ", round(100*prop2), "%\n",
                          "Difference: ", round(prop_diff*100), "%; NNT: ", round(1/prop_diff)))

image

I’m not necessarily saying NNT is a good measure; the purpose is just to show that the sorts of binary effect sizes reported by the authors are completely compatible with a minor and identical improvement in all patients.

On the average, data are not close enough to Gaussian for this to hold but under the ideal Gaussian case the Wilcoxon-Mann-Whitney-proportional odds model-test has such a tiny loss of efficiency that it’s worth it to have an insurance policy against non-normality.

1 Like

I’m not actually sure I agree that it’s advisable to speak of the “proportion of patients” in either arm who showed any particular change in any context. These types of analyses are still focused on “change from baseline,” leading insidiously to causal inferences at the level of individual patients, many of which will be invalid … I thought that pre-specified subgroup analysis is the preferred way to assess for possible heterogeneity of treatment effects (?)