Clinical Trial Outcomes Interrupted by Other Outcomes

This is a discussion of how best to analyze and to present clinical outcomes when an outcome of major interest can be interrupted by another outcome making the first outcome unobservable. The most common example is when the main outcome of interest is a non-fatal clinical event and the interrupting event is death. For the moment let’s assume that event timing can be ignored, i.e., we deal with short-term follow-up.

Those engaged in clinical trials or are a consumer of clinical trial results were asked via Twitter to fill out this brief 4-question survey. The last question in the survey is for any comments they cared to make, but it better to put your comments here as a reply to this post. Here is the survey’s preface:

Suppose that in a high risk patient population a new treatment is intended to reduce the risk of stroke within 30 days of a certain intervention. Treatments are labeled A and B and 100 patients are randomized to each treatment. Stroke may be interrupted by all-cause death, and death may also occur at or after a stroke. So the count of the number of strokes is the number of patients having a stroke before or at the moment of death.

Suppose the trial results in the following event frequencies, and for the sake of discussion assume that the strokes that occur, while serious, did not result in disability severe enough to be deemed worse than death.

a stroke, alive 11 15
b stroke, fatal 1 1
c stroke, later death 1 1
d death, no stroke 5 1
e stroke 13 17 a + b + c
f death 7 3 b + c + d
g stroke or death 18 18 a + b + c + d

The survey asks you to select one of the above outcome measures for emphasis if you were forced to emphasize only one.

Survey Results

Survey responses were tabulated on 2021-06-09. Survey respondents provided a number of valuable comments, found here. The results on the multiple choice questions are as follows.

193 survey responders

Choice Frequency
a stroke, alive 10
b stroke, fatal 2
c stroke, later death 0
d death, no stroke 2
e stroke 44
f death 19
g stroke or death 116
Choice Frequency
clinical trialist 40
clinician not engaged in clinical trial design 37
statistician 79
epidemiologist 17
other 20
Choice Frequency
A traditional comparison of two proportions where death and stroke are considered equally bad 52
An ordinal analysis where death is counted as worse than a nonfatal stroke, without assuming how much worse 141
clinical trialist clinician not engaged in clinical trial design statistician epidemiologist other
a stroke, alive 2 4 2 0 2
b stroke, fatal 0 2 0 0 0
c stroke, later death 0 0 0 0 0
d death, no stroke 0 1 1 0 0
e stroke 6 6 22 5 5
f death 3 4 5 4 3
g stroke or death 29 20 49 8 10
clinical trialist clinician not engaged in clinical trial design statistician epidemiologist other
A traditional comparison of two proportions where death and stroke are considered equally bad 14 6 22 4 6
An ordinal analysis where death is counted as worse than a nonfatal stroke, without assuming how much worse 26 31 57 13 14

Assessment of Survey Responses

Even though this informal survey undoubtedly had a biased sample, the results are still informative and interesting. I am glad that (1) the most frequent choice in every category of respondent was g: stroke or death, and (2) the most frequent choice of analysis strategy in every category of respondent was an ordinal analysis where death is counted as worse than a nonfatal stroke, without assuming how much worse. My take is that death and stroke are not possible to separate, so choices a-e lead to very hard-to-interpret statistical summaries. More generally, my opinion is that any outcome that is interrupted by another outcome, especially one that is worse (such as death). makes the original outcome target hard if not impossible to interpret. And any analysis that uses outcome-conditional probabilities (e.g., P(stroke if alive)) will be a challenge and not respect intent to treat (ITT) since it conditions on a post-randomization outcome. That being the case it is advantageous to count the worse outcome as worse, here by treating the outcomes as ordinal levels A (no stroke), B (stroke), C (death) where death overrides stroke. Had the study been longitudinal with daily outcome assessments, death would not override an earlier stroke but would be counted as the worst outcome on the day of the death. A stream of outcomes for one patient might be for example A A A B C for 5 days.

Even though death and stroke are impossible to separate, an ordinal or a polytomous (aka multinomial or unordered categorical) analysis would allow one to estimate unconditional probabilities respectful of ITT. For example one can estimate P(death), P(death or stroke), P(stroke and alive). Not that the last probability is smaller than P(stroke given alive).

A Better Way

This all leads to a proposed better way to analyze the data and interpret the result. Make a daily assessment of an ordinal scale with, say, 10 levels of stroke severity, and death. The ordinal outcome on a given day would be 1-9 for a non-debilitating stroke, 10 for death, and 11 for a stroke requiring total medical care. An ordinal Markov state transition model can be used for the longitudinal analysis. The results can be stated for example in terms of the probability of being in outcome level y or worse on a given day, or the mean number of days in a range of outcomes. One can estimate the mean number of days with Y > 6 (which might mean moderately severe or severe stroke or death) or the mean number of days with Y > 6 and Y < 10 (moderately severe stroke but alive). Ordinal longitudinal models are detailed here.

As argued later in this post, state transition models yield quantities that are easier to interpret than competing risk analysis. They also elegantly handle recurrent events such as hospitalization.


would be interesting to ask the same Q (re “an outcome of major interest can be interrupted by another outcome”) in 5 years. Much in stats is habit or “tradition” and we are being moved towards estimands. I notice Mouna Akacha won the pharma/rss award for her work on this. My answer was g and the traditional analysis


I’m seeing both good and bad in the move towards ‘estimands’. The bad: some are thinking that an estimand is useful just because it is explicitly defined.

1 Like

I just saw a preview of the psi conference this month:

PROGRAMME HIGHLIGHT Estimands, estimands, estimands…

Still trying to get to grips with ICH E9 (R1)? Believe you have a great understanding but want to lean more & see what other people are doing? Somewhere in between? On Wednesday we have a full stream focusing on estimands with three consecutive sessions:

  1. Celebrating 1-year of ICH E9(R1) Implementation, Wednesday 9:00-10:15
  2. “What?” Is the Question! Estimands: Case Studies & Implementation, Wednesday 11:00-12:30
  3. Estimands & Missing Data Imputation, Wednesday 14:00-15:30

looking forward to some discussion

1 Like

I am a clinician with a small amount of experience in clinical trial design. I was wondering how the traditional vs ordinal method affects the sample size calculations of the study and whether the trade-off in terms of larger numbers justifies the additional information gained in the ordinal design?

1 Like

Super question. Breaking ties by converting from a binary to an ordinal outcome reduces the needed sample size. See this.

Completed it and looking forward to the discussion; good opportunity to finally learn more about ordinal methods. Are there any examples of studies where the ordinal approach was used in a setting with these more or less competing outcomes?

but are there simulations showing how power is affected by floor, ceiling effects or clustering around a few points etc, because i guess in some cases (eg when an ordinal outcome is newly defined such as the WHo covid ordinal scale) these situations are not always easily anticipated

Ordinal models handle floor, ceiling effects and bimodality plus anything else you can throw at them without losing power. Methods designed for smooth unimodal distributions will suffer. The key point though is how well distributed are the ordinal values, as the link above details.

1 Like

In COVID-19 trials with the WHO ordinal scale, and also see this.

1 Like

One of the areas where this sort of issue arises is in perinatal trials. Often interventions are given to women before birth, but some or all of the important outcomes are those of the babies in the neonatal period (and subsequently). But a proportion of babies will not survive to birth, and obviously there are no neonatal or childhood outcomes if there is death before birth.

I’ve been having discussions with clinicians involved in perinatal trials about how to deal with this (largely in the context of incorporating data on neonatal deaths (death in 28 days after birth I think) into systematic reviews). One thing that is commonly done is to use the number of live births as the denominator, which is in line with the clinically accepted definitions (and the official WHO definition).

The problem is (of course) that this makes the outcome (neonatal death) conditional on surviving to birth, which may (again of course) be affected by the intervention. The most obvious alternative is to use the number randomised to each group as the denominator – but that feels wrong to many people because there were some in each group who could never have had the outcome, and it seems to go against the accepted clinical definition.

My question is really: what is the best thing to do here? (not limited to the two simple cases I have mentioned). I have some views but very interested in what others think.

1 Like

When in doubt, think about the raw data elements. These elements are daily status assessments. This leads naturally to state transition models. Once you estimate all the state transition probabilities as a function of time and other variables you transform these estimates into state occupancy probabilities. The problem of choosing the primary estimand is still there, but everything is completely transparent and well defined. And from the model you can choose multiple estimands. For example suppose live neonatal outcomes are coded 0-100 with larger being worse and Y=101 indicating neonatal death. You can compute the probability of an outcome worse than level 50, \Pr(Y > 50) which is the the probability of death or a bad live outcome. Or you can compute \Pr(Y > 50 \cap Y < 101) which is the probability of a bad outcome but still being alive. By summing such probabilities over a fixed time interval such as 30d you get the expected time in selected states, which is a pretty good summary measure. It is easy to define state occupancy probabilities, and easy for readers of journal articles to interpret. There is also a hidden advantage: missing assessment days for live babies represent interval censored values and are explicitly treated as such in the estimation process. So state transition models can handle gaps in data collection—something with which time to event analysis has difficulty. Time to event analysis can only handle censoring at the end of the observation sequence, i.e., follow-up stops and doesn’t restart.

1 Like

doesnt that violate ITT? satisfying ITT was the motivation for estimands i’d think. Using the number randomised is just imputing as no event, id phrase it like this to make the imputation explicit

Yes - of the “naive” options, retaining the number randomised as the denominator seems to me much the safer bet. That gives the risk of neonatal death among all those randomised, so from the point of view of when a woman received treatment, the risk of that particular outcome (regardless of any other outcomes that may occur).

Getting into who was “eligible” for an outcome seems to me potentially problematic; where participants are dead it’s clear they can’t have subsequent outcomes, but it isn’t always that clear cut.

Thanks - this seems a really sensible approach, but I don’t know if anyone has ever used it.

The state transition model I proposed does not have the problem you’ve described by the point at which state transition probabilities are converted to state occupancy probabilities (SOPs). At the first transition, from day 0 (baseline) to day 1 post randomization, the state transition probabilities are intent to treat (ITT) parameters. For then on, the transitions are conditional upon post-randomization events so are no longer ITT. But SOPs de-condition (marginalize) over these post-randomization conditions to arrive at unconditional (except for baseline covariates) probabilities. These unconditional SOPs are ITT. For example, the unconditional probability of being in states Y > 50 on day 3 is only dependent on baseline covariates, including treatment assignment.

Quantities derived from SOPs are also ITT, for example the mean number of days with Y > 50.

To make the argument more clear, consider the case where you have enough sample size that you don’t need to model the outcomes, and suppose there are no losses to follow-up before day 3 other than some deaths on day 2. These as absorbing states carry forward to day 3. Then you don’t need to bother with the intermediate state transition models, and the SOP estimate at day 3 is just the number of babies with Y > 50 on day 3 or who died before day 3, divided by the total number of babies.

There is an analogy to Cox proportional hazards survival modeling with time-dependent hazard ratios. If you use a time-dependent covariate to allow treatment to interact with time, i.e., you are allowing the effect of treatment to vary with time on the hazard ratio (HR) scale, the treatment HR at the first post-randomization time will be ITT. After that the varying HRs use smaller post-randomization risk sets and are not ITT. The HR under proportional hazards is an ITT parameter but under time-varying treatment effect is not ITT. But when you convert all these hazard ratios to a cumulative incidence function you de-condition on all those time-varying risk sets to arrive at an ITT estimate of the probability of dying by time t given only baseline covariates.

In conclusion, quantities that are conditional only on baseline covariates are fine estimands. This includes cumulative incidence functions and state occupancy probabilities. No imputation is involved.

The top post has been updated to include survey results and discussion of them.

1 Like

This is a valuable and very interesting discussion. Like @simongates, my focus is in perinatal/neonatal research. The issue of how certain outcomes are only possible when subjects are in certain “states” comes up all the time.

It would seem that the state transition model you proposed would be useful in where the competing outcomes are both feasible at any given time after randomization. That is, if t=0 is enrollment, that at any t>0 there is some potential for either transition to stroke or to death, whichever comes first. (Maybe this would be best referred to as semi-competing outcomes because death precludes stroke but not the other way around.)

In our field, such a model may be able to handle a state transition for livebirth (vs remains in utero or stillborn) and then neonatal outcomes conditional upon livebirth (e.g., liveborn mortality or postnatal infection or similar). But there is another common case in our field in which, due to the relationship of time and developmental maturity, some outcomes cannot be ascertained until some certain survival time is achieved.

Consider, for example, that retinopathy of prematurity (ROP) is developmentally not detectable until around 31 weeks postmenstrual age but is affected by interventions near the time of birth. The baby born at 23 weeks postmenstrual age has 8 weeks of time where death is possible but ROP is not. Babies who die before 31 weeks postmenstrual age had no possibility of ROP because of the time-sensitive nature of the diagnosis. Other examples include measurements of neurologic/cognitive development (we can’t assess cognitive function particularly well before about 2 years, although it may be affected by interventions around the time of birth) and bronchopulmonary dysplasia (BPD, a measure of lung disease standardized to the near-term lung, measured at 36 weeks postmenstrual age).

See these major trials from our field:

  • Death or BPD (where BPD cannot be measured until survival to 36 weeks’ postmenstrual age and is measured only at that single time point)
  • Death or ROP (where ROP cannot be measured until survival to 31 weeks’ postmenstrual age and can occur at any subsequent time during the developmental period)
  • Survival with IQ≥85 at 6-7 y (where the intervention occurs during the first 3 days after birth but IQ can only be measured at school age; this is a special case as the primary outcome is actually the complement of “death or IQ <85”)

(You might also note that the non-death outcomes in these cases are dichotomizations of ordinal or continuous outcomes, most obviously with IQ. That’s an interesting issue, too, but, regardless, there is the fact that death is possible at any time during several years but IQ can only be measured around school age.)

The example of ROP and death above is a cautionary example about the use of composite outcomes in our field, because the intervention had opposite and meaningful impacts on ROP and on death, so that the effect on the primary outcome (ROP or death) approached null. However, no better solution than using simple analyses for composite outcomes (ROP or death) has taken hold.

I would be interested to know how others imagine how to improve the way that we handle this particular – and especially common – issue of competing outcomes where the probability of some states can only be greater than zero at some future survival time, requiring that there is a long period at risk for death but not the other outcome.

This is an extremely interesting scenario Matthew, and I hope that others will add some in-depth discussion of the design and its ramifications for analysis and interpretation. The following comment may be just semantics but will help me clarify the nature of the raw data. When ROP or IQ are not measureable at an early age, isn’t it the case that they may be present but are just not revealed to us? If that is the case then the raw data may be thought of as daily ordinal assessments that may be interval censored (not time-censored as we are used to, but ordinal category-censored). If Y=0-5 represents severity of ROP with Y=0 meaning ROP is completely absent, and Y=6 is neonatal death, then on early days we can observe Y=6 (in which case ROP is irrelevant) or Y<6 in which case we do not know whether Y=0, 1, 2, …, or 5. For those days the data would be analyzed as [0, 5], i.e., interval censored between 0 and 5 inclusive. Later as ROP is assessable, we would have actual 0-5 single number observations.

For IQ you could have an early, inaccurate IQ test and ultimately a later accurate one. Over time you would then have interval censoring where the intervals ultimately narrow to a single point.

Does this make any sense? Does it help in thinking about the problem?

Thanks for the thoughtful response. The model is interesting and potentially workable.

Regarding your other question about this type of outcome where death is feasible at any t>0 but the other part of the composite is not feasible until some much later time:
> When ROP or IQ are not measureable at an early age, isn’t it the case that they may be present but are just not revealed to us?
Early life development presents an interesting conundrum for outcome measurement. And I say this with the caveat that I am a neonatologist (intensive care) and that my colleagues in developmental physiology or psychology can probably give a more precise answer to this. But in both of the outcomes above, the right model is probably not to think of the outcome as present earlier but undetectable – it is not obvious that a better measure would necessarily pick up the outcome earlier. Rather, (as every parent of small children observes) development unfolds over time and certain skills or physiological changes do not occur until certain biological maturity is reached. There is a complex sequence of events that has to occur in order to achieve these developmental changes. Early insults disrupt the potential for the normal development to occur.

If that seems abstract, consider a hypothetical here related to math performance. It is well known that math performance is worse in premature infants compared to their peers. If we were to ask if certain early postnatal interventions for preterm infants (e.g., oxygen exposure or certain medications) affected calculation, working memory, etc, those skills are not just not apparent until later in life, they probably aren’t developmentally feasible until later. So there is some natural lag whereby the effect of an early neonatal intervention in a high-risk population with high (competing) risk of death cannot be measured until some later period.

1 Like