KM estimates for different stratas do not add up and it makes me crazy!

Uriah · September 24, 2025, 7:57am

I’ve been working with Kaplan–Meier (KM) survival estimates and ran into something that seems paradoxical at first:

If I split my sample into strata, compute KM survival separately in each stratum, and then try to combine those stratum-specific survivals I don’t get the same result as if I had computed the KM on the whole data.

My motivation is to extract distribution of TP-FN / FP-TN for time-to-event predictions and splitting them into bins for a given p.threshold / ppcr. It feels more natural to count real-positives / real-negatives and than add them up, but for each p.threshold the estimate of real-posiives is different real-positives / real-negatives within the bins!

Toy example

Stratum A: 2 subjects
Stratum B: 2 subjects

Events/censoring:

t=1: event in A
t=2: censor in B
t=3: event in B

4 individuals, 2 strata.

Wrong Calculation: Stratum-specific KMs

Stratum A: 1 event at t=1 with n=2 → survival drops to 1 - 1/2 = 0.5. No further events → S_A(3)=0.5.
Stratum B: censor at t=2 (no effect on survival), then 1 event at t=3 with n=1 → survival goes to 0. So S_B(3)=0.0.

Baseline-weighted average:
With baseline weights 0.5 each (2/4 per stratum):
0.5 * 0.5 + 0.5 * 0 = 0.25.

Correct Calculation: Pooled KM

Ignore strata and compute KM on all 4 subjects.

At t=1: 1 event / 4 at risk → decrement factor = 1 - 1/4 = 0.75. Survival = 0.75.
At t=3: 1 event / 2 at risk → decrement factor = 1 - 1/2 = 0.5. Survival = 0.75 * 0.5 = 0.375.

So S_pooled(3) = 0.375.

Weighted average of stratum KMs at t=3: 0.25
Pooled KM at t=3: 0.375

If I understand correctly it has something to do with the way KM borrows information about censoring: The information about censored observations only come into play when a new event is introduced.

Any thoughts?

f2harrell · September 24, 2025, 12:17pm

I don’t have any thoughts about the main question, but two observations: First the binning strategy you are ultimately using sounds problematic, as are the notions of false positives etc. Second, if the stratification factors are related to the outcome, the unstratified KM estimates are not valid and will not estimate anything you can count on. KM assumes a single homogeneous distribution.

Uriah · September 24, 2025, 12:51pm

I’m trying to visualize TP / FP / TN / FN within each bin in the histogram of the predictions, which is fairly straightforward without censored observations: You can just count them, add them up, and play interactively with the cutoff that implies decision boundary. If you want the ratio between true-positives and predicted-positives (as a necessary component of NB) you can visualize without messing the hetereoginity between the bins: for p.threshold (0.9, 1] you expect higher ratio of TP / (TP+FP) than the ration in (0.2, 0.3] while the PPV is nothing but the weighted average of ratio in each bin. Once I use KM for censored observations I must enforce homoeginity between all the bins that are enforced to be predicted-positives, so in a way it’s kind of an interpretability tradeoff.

Let’s say you have nothing but the observed outcomes and observed time-to-events, don’t you want some kind of baseline model as a starting point?

For NB the KM is identical to the intercept, i.e the expected benefit for treatl-all strategy if no-treatment-harm is present compares to the treat-none strategy.

Pavlos_Msaouel · September 24, 2025, 7:31pm

Yup. You’ve bumped into a classic: the Kaplan–Meier (KM) estimator is non-collapsible. A baseline-weighted average of stratum-specific survivals will not equal the pooled KM, except in special (rare) cases. The reason is that KM is a product of time-local risk‐set fractions, and those risk sets change differently by stratum as censoring and events accumulate. Linear averaging can’t reconstruct that product.

For clinicians we provide an oncology example of this in Section 2 and Appendix A here. In the methodology literature, great references here and here. It is also at the heart of this very vigorous DataMethods discussion.

This mathematical artifact whereby the whole is not the sum of its parts (or what we think are its parts) vexes our intuition because such strong emergent properties are typically found in magic but not observed in the scientific world. What we do know about the laws of nature to date is consistent with weakly emergent phenomena (like conductivity or temperature), but the whole is always the sum of its parts. Hence why we need to be so very extra careful when looking at forest plots etc on the hazards ratio scale trying to fish for subgroup effects.

s_doi · September 24, 2025, 8:12pm

As aptly stated by Kuha and Mills, noncollapsibility is often presented as a problem or a limitation of the model, which is again thought to compromise group comparisons with such models. We argue that it is not and does not. This is because the models across strata (say male/female) are estimating different true effects. Men, women, and both together are three different groups. What we now know is that even if a noncollapsible effect is the same among both of the genders separately, it still does not need to be the same in the combined group. This conclusion does not pose any new problems for the estimation of these effects. What it tells us is simply that even if there is no interaction, the effect that holds at each stratum separately cannot be estimated from an unstratified model but must be estimated from a model that includes the stratum variable as well.

Regarding fishing for subgroup effects - a single trial cannot guarantee that a subgroup difference is real and in most cases will simply be an artifact of the sample. Perhaps they should still be reported so that a meta-analyst can analyze if they are consistent across studies or not but certainly do not mean much in single trials.

Uriah · September 25, 2025, 5:27am

Thanks!
I’ll read and reread all the links, they are certainly helpful!

Just to make it clear, without censoring the KM/CIF/AJ estimates are in fact collapsible right?

f2harrell · September 25, 2025, 12:59pm

These measures are much more problematic than researchers realize. See for example In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion – Statistical Thinking and other blog posts there.

Pavlos_Msaouel · September 25, 2025, 3:41pm

Yup. With no censoring (and no left truncation) up to time t, KM is collapsible under baseline weights. If there’s any censoring before t (or left truncation/delayed entry), collapsibility breaks under fixed baseline weights as you observed. One more very nice reference here (pointed in the past by @Sander) with discussions here, here, and here and rejoinder focusing on principled standardization to obtain valid marginal survival comparisons.

s_doi · September 26, 2025, 10:21am

I have pushed back on this in the thread noted by @Pavlos_Msaouel in his first post above. Just to reiterate:

We will focus on their paper where there is a “treatment” and a “control,” abbreviated as 𝑅𝑥 and 𝐶, respectively. Our discussion will be in the setting of a randomized controlled trial (RCT). They denote the third binary variable groups 𝑔+, 𝑔−, and their combination, all-comers, as {𝑔+, 𝑔−}. For the example in their Table 8, we have OR𝑔+ = (1∕3) / (1∕9) = 3, OR𝑔− = (3∕1) / (1∕1) = 3, OR{𝑔+,𝑔−} = (1∕1) / (3∕7) = 21 3 , which they claim is illogical because suppose clinically meaningful efficacy is defined as OR > 2.5, for example, then 𝑅𝑥 is efficacious in both 𝑔+ and in 𝑔− relative to 𝐶, but lacks efficacy in {𝑔+, 𝑔−}. They therefore recommend efficacy of a medicine not be measured by OR when there are strongly prognostic subgroups. For the example in Table 8, with the OR for 𝑔+ and 𝑔− being common (i.e., the third variable appears to be purely prognostic), they say mixing appears to dilute the OR and they say this is no accident.

What’s the problem? Well here it is:

If your goal is policy or reimbursement, population-average (RR/RD) is natural.

If your goal is patient-level treatment decisions, the OR arguably gets you closer to the individual causal effect — because it isolates the “strength of the treatment” from the patient’s starting point.

This is exactly why I feel that the all-comers RR/RD are “wrong”: they describe no one. They are a synthetic mixture that can’t be applied back to any single patient, while the OR gives you a portable multiplicative effect. For individual-level decisions, the OR has an intuitive appeal as the closest empiric estimate of individual treatment effect. The cost is that when you aggregate, the all-comers OR doesn’t correspond to any population mixture. The field is split by authors such as these who conflate “population-interpretable” effects with “portable, individual-interpretable” effects and refuse to make this distinction.

NB: This will be my last post on this topic to avoid the thread spiraling off topic

f2harrell · September 26, 2025, 12:33pm

Population averaged measures are only applicable if the policy is applied at the population level such as closing a border during an epidemic. Even though, person-specific estimates will be pertinent so that a group decision is not based on a Simpson’s paradox-like situation.

Uriah · September 29, 2025, 8:16pm

@Pavlos_Msaouel:

I think appendix c in Correct and logical causal inference for binary and time-to-event outcomes in randomized controlled trials explains well this issue for the KM estimate.

This leads me to another thought - doesn’t KM underestimates the true incidence-rate once the last observation on the data set is being censored?

I agree, that’s exactly why I think it is so important to understand how different these estimates (PPV and NPV) act for different cutoffs due to their cumulative nature: Surely for lower probability thresholds which are more common in healthcare.

It can be shown with a histogram that displays the distribution within each bin of the prediction in a stacked manner (i.e “real-negatives” and “real-positives”). Problem is, these distributions won’t collapse into the KM estimate for the whole predicted-positives subpopulation.

Pavlos_Msaouel · September 29, 2025, 8:58pm

Depends on what we are estimating and where we stop. KM is only defined up to the last event time. If the largest observed time is a censoring time (i.e., no events occur at the max follow-up), the KM curve is typically kept flat after the last event. Thus, if we report risk at that maximum time, we will indeed underestimate the true risk at that horizon (there could be events between the last event and the last censoring, but KM cannot “see” them). So the underestimation is a by-product of extrapolating from a flat tail.

As we noted in Section 10 here, this is why it has been proposed to avoid presenting KM curves after the time point where only around 10% to 20% of patients remain at risk of the failure event.

f2harrell · September 30, 2025, 12:27pm

As an aside I think it’s slightly better practice to report confidence bands without having a first “stop reporting” recommendation. And keep in mind that K-M only applies to homogeneous samples. That rules out 0.99 of the applications of K-M.

Pavlos_Msaouel · September 30, 2025, 12:55pm

Yup. Especially the comparative half-width confidence bands you include in the rms package and taught us to use are particularly useful as shown in Figure 1 here.

Uriah · October 13, 2025, 12:06pm

Kind of related:

Are pseudo-observations collapsible?

@Pavlos_Msaouel / @f2harrell are you experienced with mult-state calibration plots with censored observations?

I’m not sure if there’s a fancy justification for it, but it’s kind of counter-intuitive for me to trust estimates that do not add up to a weighted average of the whole population estimate.

I like discrete versions of Calibration Plots, it is straightforward just to compare counts and averaged-probabilities. I guess it gets complicated for censored-observations…

Survival:

Multi-States:

Pavlos_Msaouel · October 13, 2025, 2:29pm

Good question. No experience at all from my end.