The critical care community brings us a lot of interesting examples. We previously had a good thread about ANDROMEDA-SHOCK which saw much participation. In the past day or two, I’ve seen lots of Twitter discussion (and received a couple of direct requests) about the CITRIS-ALI trial, so I’d like to post a thread here that can serve as a living record of this discussion.
One disclaimer before we begin: I am not an intensivist, and therefore not fully versed in the degree of “biologic plausibility” here, so that will not be a principal focus of the discussion. Please note that I am not saying biologic plausibility for treatments is unimportant, but evaluating that is simply outside of my wheelhouse. I do see some discussion that thinks this entire trial (and entire line of research) is silly because the mechanism of action is not sufficiently understood to be testing in trials. Not my place to comment, and such discussion is likely best had elsewhere.
OK, now let’s proceed to discuss the CITRIS-ALI design, results, and controversy. There is a very detailed post at https://emcrit.org/pulmcrit/pulmcrit-citris-ali-can-a-secondary-endpoint-stage-a-coup-detat/ which I recommend reading for background on the patient inclusion criteria, treatments, as well as some context on why the trial has the endpoints that it does; I will summarize a few highlights here.
trial enrolled patients with “ARDS due to sepsis” (PulmCrit blog post has some good explanation on why that ended up being the trial population)
trial was placebo-controlled comparison testing IV vitamin C (at a very high dose) every six hours for up to four days
the co-primary endpoints are “Change in SOFA score at 96 hours compared to baseline” and “levels of C-reactive protein and thrombomodulin.” This seems like an unusual choice of endpoint, but again the PulmCrit post has some explanation of why this occurred - the NIH evidently indicated a preference to use an endpoint other than mortality for the primary endpoint, and specified that there should be a clinical endpoint and a biomarker endpoint.
OK, that should be enough context that we can move on to the controversy surrounding the CITRIS-ALI primary results. The “primary” endpoints were negative - no difference in either the “change in SOFA score” nor the biomarker endpoints. But there was a large observed mortality difference (even with the wide CI’s one would expect in a trial that was designed as a sort of “Phase 2” study - the trial was not powered for a mortality difference) and even more importantly, the patients who died in the first four days were not included in the primary endpoint analyses (e.g. patients that died before 96 hours were not included in the “change in SOFA score” endpoint). Mortality at 4 days was about 23% in the placebo group and about 4% in the Vitamin C group - so a big chunk of the Placebo group is omitted from the primary endpoint - and that “big chunk” is missing for a highly informative reason.
Many commenters thus far have focused on a couple of things: mortality was not the pre-specified primary endpoint so it must only be considered hypothesis-generating; there were many secondary endpoints that were not significant, so we have to adjust for multiple comparisons when considering the mortality difference between groups; the confidence intervals are wide, and the trial was not really powered to test for mortality difference; things like that.
What this seems to overlook is that mortality is important here not just as an endpoint itself, but also because of its effect on the primary endpoint. If what I have read is correct, and I’ve seen this several places, the patients who died before 96 hours were simply omitted from the primary SOFA endpoint analyses. Obviously that is a problem - early death is a highly informative form of missingness in this setting. If Vitamin C confers a significant benefit, but the patients on Placebo with the worst outcomes are omitted from the primary endpoint (an organ failure score at 96 hours) because they died before it could be ascertained, the comparison of SOFA scores is dramatically underestimating the true magnitude of the treatment effect.
The SOFA scores went down in both the Placebo and the Vitamin C arms, but if the analysis only includes the survivors, that’s isn’t very surprising at all. Patients that died early (most likely) would have worsening (or at least not-improving) organ failure. Patients that survived the first few days, somewhat predictably, had a reduction in organ failure. With differential survival between the two groups, I don’t think a comparison of the SOFA scores at 96 hours only in the survivors has much value in estimating how effective Vitamin C is for these patients.
What to do in this situation is rather complex. There are a few analytic options that I’m aware of, such as the naive “assign them all the worst SOFA score and analyze the difference in means as planned” or using a composite rank-based endpoint (e.g. all patients that died in the first 4 days are assigned the worst possible outcome, perhaps assigning them a score that is 1 point higher than the worst possible SOFA score, and then comparing the ranks of the treated group versus the control group). I’m sure Frank and others may have additional suggestions. Conceptually, I do like the composite rank-based endpoint approach.
A few thoughts of mine that this brings up:
one, I have to say it, the Pulm Crit post describing the history and origins of the trial highlight a painful challenge of funding trials through NIH mechanisms. I’m not really sure how much of the trial design was “what the trialists really thought would be the best population and endpoint” versus their efforts to respond to specific items in the RFA, or just reading tea leaves from prior grant comments to try and propose a trial that appeased all reviewers from the study section. Was “ARDS due to sepsis” really the population they wanted to test, or was it an effort to shoehorn this into an RFA targeted at ARDS when they’re more broadly interested in sepsis patients? Same issue for the endpoints: they had to pick a “clinical” and a “biomarker” endpoint, leaving mortality on the list of secondaries because it seems impossible NOT to examine mortality…now this also has me wondering if they wanted to consider mortality as part of the primary clinical endpoint (perhaps using one of the approaches defined above) but were discouraged from doing so because NIH feedback discouraged them from using mortality in any way as part of the primary endpoint? I do not know whether that is true, but it seems somewhat plausible to me, that if the NIH had explicitly stated they did not want to see mortality as a primary endpoint, perhaps the investigators felt trapped & that they had to simply omit the early deaths from their SOFA analyses?
two, it shows the tricky game we have to play dancing around the multiple comparisons and defining “primary” and “secondary” endpoints. Many folks have pointed out that the mortality result should be taken with caution (or dismissed outright) because it was just a secondary endpoint, there are many other endpoints that weren’t significant, etc. My issues with that are basically i) not all endpoints are of equal importance and ii) as Frank often likes to point out, multiplicity adjustments tend to discourage people from taking advantage of the rich data source they have available (if you have lots of rigorous, well-collected data from a trial, we should explore it to understand as much as possible about what happened). If the authors had just decided not to look at all of the other endpoints and focused on mortality, the result would be “significant” (ugh). I’m not advocating free-for-all analyses of infinitely many endpoints where you can declare victory any time you get one with a small p-value (I’ve actually written a paper about this before) but there is something disquieting to me about this business, that whether a result for an important endpoint like mortality is considered “significant” or “not significant” based partly on how many other outcomes the authors decided to consider.
three, this is basically the equivalent of a Phase 2 trial. My final takeaway is that it certainly seems worth following up with additional trials to see whether Vitamin C has a benefit (and apparently several are already underway, although I’m not sure what the subtle differences are in the inclusion criteria, timing of treatment, and endpoints). The CITRIS-ALI results seem to be interpreted this way by most of the reasonable middle ground, but there’s one extreme saying “Who cares, secondary endpoint, multiplicity, it’s a negative trial” and another extreme that may believe it would be unethical to conduct a further RCT because they consider CITRIS-ALI sufficient evidence that they no longer feel equipoise is present.
I don’t envy anyone trying to figure out what to make of this in clinical practice, but CITRIS-ALI seems to suggest at least the possibility of significant benefit (and concomitantly, it would have been quite unlikely to observe this data if Vitamin C is harmful).
I may edit this post if I spot an error (tried to type this up in a bit of a hurry) or someone provides new information that is worth putting in the first post so every sees it right away. Otherwise, I look forward to some discussion.