CITRIS-ALI Results & Discussion Thread

ADAlthousePhD · October 3, 2019, 12:50pm

The critical care community brings us a lot of interesting examples. We previously had a good thread about ANDROMEDA-SHOCK which saw much participation. In the past day or two, I’ve seen lots of Twitter discussion (and received a couple of direct requests) about the CITRIS-ALI trial, so I’d like to post a thread here that can serve as a living record of this discussion.

One disclaimer before we begin: I am not an intensivist, and therefore not fully versed in the degree of “biologic plausibility” here, so that will not be a principal focus of the discussion. Please note that I am not saying biologic plausibility for treatments is unimportant, but evaluating that is simply outside of my wheelhouse. I do see some discussion that thinks this entire trial (and entire line of research) is silly because the mechanism of action is not sufficiently understood to be testing in trials. Not my place to comment, and such discussion is likely best had elsewhere.

OK, now let’s proceed to discuss the CITRIS-ALI design, results, and controversy. There is a very detailed post at https://emcrit.org/pulmcrit/pulmcrit-citris-ali-can-a-secondary-endpoint-stage-a-coup-detat/ which I recommend reading for background on the patient inclusion criteria, treatments, as well as some context on why the trial has the endpoints that it does; I will summarize a few highlights here.

trial enrolled patients with “ARDS due to sepsis” (PulmCrit blog post has some good explanation on why that ended up being the trial population)
trial was placebo-controlled comparison testing IV vitamin C (at a very high dose) every six hours for up to four days
the co-primary endpoints are “Change in SOFA score at 96 hours compared to baseline” and “levels of C-reactive protein and thrombomodulin.” This seems like an unusual choice of endpoint, but again the PulmCrit post has some explanation of why this occurred - the NIH evidently indicated a preference to use an endpoint other than mortality for the primary endpoint, and specified that there should be a clinical endpoint and a biomarker endpoint.

OK, that should be enough context that we can move on to the controversy surrounding the CITRIS-ALI primary results. The “primary” endpoints were negative - no difference in either the “change in SOFA score” nor the biomarker endpoints. But there was a large observed mortality difference (even with the wide CI’s one would expect in a trial that was designed as a sort of “Phase 2” study - the trial was not powered for a mortality difference) and even more importantly, the patients who died in the first four days were not included in the primary endpoint analyses (e.g. patients that died before 96 hours were not included in the “change in SOFA score” endpoint). Mortality at 4 days was about 23% in the placebo group and about 4% in the Vitamin C group - so a big chunk of the Placebo group is omitted from the primary endpoint - and that “big chunk” is missing for a highly informative reason.

Many commenters thus far have focused on a couple of things: mortality was not the pre-specified primary endpoint so it must only be considered hypothesis-generating; there were many secondary endpoints that were not significant, so we have to adjust for multiple comparisons when considering the mortality difference between groups; the confidence intervals are wide, and the trial was not really powered to test for mortality difference; things like that.

What this seems to overlook is that mortality is important here not just as an endpoint itself, but also because of its effect on the primary endpoint. If what I have read is correct, and I’ve seen this several places, the patients who died before 96 hours were simply omitted from the primary SOFA endpoint analyses. Obviously that is a problem - early death is a highly informative form of missingness in this setting. If Vitamin C confers a significant benefit, but the patients on Placebo with the worst outcomes are omitted from the primary endpoint (an organ failure score at 96 hours) because they died before it could be ascertained, the comparison of SOFA scores is dramatically underestimating the true magnitude of the treatment effect.

The SOFA scores went down in both the Placebo and the Vitamin C arms, but if the analysis only includes the survivors, that’s isn’t very surprising at all. Patients that died early (most likely) would have worsening (or at least not-improving) organ failure. Patients that survived the first few days, somewhat predictably, had a reduction in organ failure. With differential survival between the two groups, I don’t think a comparison of the SOFA scores at 96 hours only in the survivors has much value in estimating how effective Vitamin C is for these patients.

What to do in this situation is rather complex. There are a few analytic options that I’m aware of, such as the naive “assign them all the worst SOFA score and analyze the difference in means as planned” or using a composite rank-based endpoint (e.g. all patients that died in the first 4 days are assigned the worst possible outcome, perhaps assigning them a score that is 1 point higher than the worst possible SOFA score, and then comparing the ranks of the treated group versus the control group). I’m sure Frank and others may have additional suggestions. Conceptually, I do like the composite rank-based endpoint approach.

A few thoughts of mine that this brings up:

one, I have to say it, the Pulm Crit post describing the history and origins of the trial highlight a painful challenge of funding trials through NIH mechanisms. I’m not really sure how much of the trial design was “what the trialists really thought would be the best population and endpoint” versus their efforts to respond to specific items in the RFA, or just reading tea leaves from prior grant comments to try and propose a trial that appeased all reviewers from the study section. Was “ARDS due to sepsis” really the population they wanted to test, or was it an effort to shoehorn this into an RFA targeted at ARDS when they’re more broadly interested in sepsis patients? Same issue for the endpoints: they had to pick a “clinical” and a “biomarker” endpoint, leaving mortality on the list of secondaries because it seems impossible NOT to examine mortality…now this also has me wondering if they wanted to consider mortality as part of the primary clinical endpoint (perhaps using one of the approaches defined above) but were discouraged from doing so because NIH feedback discouraged them from using mortality in any way as part of the primary endpoint? I do not know whether that is true, but it seems somewhat plausible to me, that if the NIH had explicitly stated they did not want to see mortality as a primary endpoint, perhaps the investigators felt trapped & that they had to simply omit the early deaths from their SOFA analyses?
two, it shows the tricky game we have to play dancing around the multiple comparisons and defining “primary” and “secondary” endpoints. Many folks have pointed out that the mortality result should be taken with caution (or dismissed outright) because it was just a secondary endpoint, there are many other endpoints that weren’t significant, etc. My issues with that are basically i) not all endpoints are of equal importance and ii) as Frank often likes to point out, multiplicity adjustments tend to discourage people from taking advantage of the rich data source they have available (if you have lots of rigorous, well-collected data from a trial, we should explore it to understand as much as possible about what happened). If the authors had just decided not to look at all of the other endpoints and focused on mortality, the result would be “significant” (ugh). I’m not advocating free-for-all analyses of infinitely many endpoints where you can declare victory any time you get one with a small p-value (I’ve actually written a paper about this before) but there is something disquieting to me about this business, that whether a result for an important endpoint like mortality is considered “significant” or “not significant” based partly on how many other outcomes the authors decided to consider.
three, this is basically the equivalent of a Phase 2 trial. My final takeaway is that it certainly seems worth following up with additional trials to see whether Vitamin C has a benefit (and apparently several are already underway, although I’m not sure what the subtle differences are in the inclusion criteria, timing of treatment, and endpoints). The CITRIS-ALI results seem to be interpreted this way by most of the reasonable middle ground, but there’s one extreme saying “Who cares, secondary endpoint, multiplicity, it’s a negative trial” and another extreme that may believe it would be unethical to conduct a further RCT because they consider CITRIS-ALI sufficient evidence that they no longer feel equipoise is present.

I don’t envy anyone trying to figure out what to make of this in clinical practice, but CITRIS-ALI seems to suggest at least the possibility of significant benefit (and concomitantly, it would have been quite unlikely to observe this data if Vitamin C is harmful).

I may edit this post if I spot an error (tried to type this up in a bit of a hurry) or someone provides new information that is worth putting in the first post so every sees it right away. Otherwise, I look forward to some discussion.

rleitepacheco · October 3, 2019, 1:36pm

Thank you very much for this post.

I am an early career research clinician with limited statistical background. When I first read the report of the RCT I immediately thought of Confirmation bias (https://catalogofbias.org/biases/confirmation-bias/).

People seem too harsh on the interpretation of an important reduction of mortality (very low NNT), just because they refuse to believe in a “biologic plausibility” for the intervention.

Not pointing fingers, but the two most clinical relevant outcomes showed great improvement with the treatment and they should had been the primary outcomes and not the surrogate outcomes. I wonder what would be the conclusions and interpretation of these results if this was a trial with a new (expensive) intervention.

After reading the statistical commentary regarding the analysis of the primary outcomes, I’m continuing on my hypothesis of how much people disbelieve the intervention is affecting the analysis and interpretation of this RCT.

Not advocating in favor or against vitamin C, this is still one RCT and no one should jump in recommending for clinical practice. But for me, evidence of effectiveness is worth more than biologic plausibility, and I believe these findings should be taken more “seriously” by clinicians.

Rafael Pacheco

simongates · October 3, 2019, 2:12pm

Great points Andrew. Just to pick up your second one a bit; yes, the rigid distinction into “primary” and “secondary” outcomes is pretty unhelpful. I assume it arose from concerns about multiple testing and the things you say about declaring a win if you find anything with p < 0.05. But in most cases there isn’t just one important outcome and it doesn’t make sense to hang the whole trial result on just one.

The issue that this trial highlights has cropped up in quite a few critical care trials in the past, and I’m not sure whether any of them have used analytical methods that take the deaths into account adequately. There are several possible approaches - I also like the general composute ranking approach (there are quite a few ways it could be implemented).

ADAlthousePhD · October 3, 2019, 2:19pm

What’s especially interesting about this is that (at least some of) the people who are trying to temper enthusiasm about the mortality difference (fine) also seem to be missing just how much the mortality difference could influence the apparent “lack of benefit” on the primary endpoint. I guess it’s sort-of-okay with me if you want to point out that mortality is “just” a secondary endpoint in this trial and should not be analyzed without accounting for the multiple comparisons, but it’s critical to note that the early mortality difference also affects the primary endpoint and may well explain the apparent “lack of benefit” on the primary endpoint of 96-hour SOFA score.

I would love to hear from the trialists, especially given some of the other content in the PulmCrit post about how this trial came to be. I really do wonder if the NIH expressed desire to use a non-mortality endpoint (assuming that story is true) meant that the trialists were scared from even attempting the most rudimentary analyses like “assign the worst SOFA score to the non-survivors” or the composite ranking approaches we are discussing here.

llynn · October 3, 2019, 2:27pm

Thank you Andrew.

This discussion is important because it adresses more than the math aspect of statistics. The other aspect is the mathematical validity of the gold standard and the primary endpoints.

This trial is classic “threshold science” which uses on size fits all guessed set of thresholds for the gold standard but it is somewhat unique because it also uses a 1990s guessed set of thresholds (SOFA) as a primary outcome. A Threshold science sandwich if you will. The movement of endponts from objective dichotomous states to less mathematically definable surrogates is increasingly popular. For this reason statisticians must prepare themselves.

With reference to mortality, a SOFA score of 2 is not equal to a SOFA score of 2. That may sound stange but a little investigation will reveal its truth. We could generate mortality functions for each threshold cutoff of each of the 5 signals and then each combinations each threshold cutoff but that would be chasing thresholds all the way down. It should be ckear that this will produce marked anomalies and render studies nonreproducible.

I hate to say I told you so. No, maybe I’m ok with thst given the pushback. This has been my point in this datamethods discussion. Statisticians must investigate study and uderstand (mathematically) the gold standard and determine the mathematical validity and reproducibility of the gold standards and now it seems the outcomes. Its not enough to dutifully apply the math because that simply mathematically embellishes guesses and will not render reprodicible results.

ADAlthousePhD · October 3, 2019, 2:59pm

I suspected that you would have an opinion on the use of SOFA as a primary endpoint, Lawrence.

I fully admit that I am a bit perplexed by some of the objections I read about the use of an objective endpoint like all-cause mortality (linked in the PulmCrit post was another post stating the author’s opinion on why mortality is not a good endpoint: PulmCrit- Chasing mortality endpoints is a fool's errand). That particular PulmCrit post contains a lot of truth, but also some stuff that puzzles me a little bit as an objection to mortality as an endpoint. IMO, the best ‘objection’ to mortality as an endpoint is when mortality is a much rarer event or one that takes a relatively long time to occur, rendering the trial impractical unless they can enroll thousands of patients and follow them for 5 years. In the intensive care setting, neither of those is really true, I’m rather less convinced by “it’s hard to make difference in mortality” because, well, that seems kind of like the point. As a patient, I probably don’t care that much if my SOFA score improved, I care whether I walked out of the hospital alive.

In the post above, I also note above the vagaries of the NIH process, and it is possible that the trialists used this endpoint because they were told to use a “clinical endpoint” other than mortality, and this was most palatable to the field / study section.

It’s perfectly fair to point out the limitations of SOFA. Given that this is the trial data in front of us, I am (attempting) to start a discussion about the results that we do have in front of us, and what that means for now and for future trials. The mortality data are important, and they’re especially important in light of seeing people look at the “negative” primary endpoint (flawed though SOFA might be) given the bias that differential mortality would introduce to that endpoint.

llynn · October 3, 2019, 5:02pm

Yes Andrew.
You are likely correct that they appear to have decided that mortality is a bridge too far.

My point was simply to say that statiticians can no longer cede the determination of the validity and reproducibility of the gold standard and outcome to the PI.

The statement SOFA “flawed as it maybe” is standard political speak from threshold science. I’ve heard similar euphemismistic phrases for 30yrs so I suspect that is where you heard it.

Let me give you an example.

Suppose your in the country and arrive at a wooden bridge with your family in a large SUV and a woman stops your car and says that bridge may be flawed. You dont say flawed as it maybe I going to use it. No you investigate the bridge.

Thats all I am asking of statiticians.

Investigate gold standard and outcome bridges before you cross them with your beautiful and brilliant statistical SUVs.

ESMD · October 3, 2019, 5:44pm

This is an interesting trial result and good fodder for discussion. I might be off base, but it strikes me that the key feature of this trial that creates headaches for interpretation (as was true for the ANDROMEDA-SHOCK trial), isn’t so much the definition of primary vs secondary endpoints (and the disagreement between them), but rather the small number of outcomes of interest that were observed (stemming from small trial size). Even if mortality had been defined as the primary endpoint of interest in this trial, I’m still not so sure that the findings would have been easy to interpret (see other thread on “Stability of results from small RCTs”).

My (very) cursory glance at the background to this trial suggests that any pre-trial attempt to estimate a plausible effect of Vitamin C on mortality might have been on pretty shaky grounds. If we couldn’t reliably define a “plausible” effect on mortality before running the trial, then wouldn’t it be difficult to know how to power it adequately to look at a between-arm mortality difference in any sort of reliable way?

A paper in this link from the “Stability of results from small RCTs” thread looked at how under-powered studies can provide misleading results:

http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

“…if sample size is too small, in relation to the true effect size, then what appears to be a win (statistical significance) may really be a loss (in the form of a claim that does not replicate…

The interpretation of a statistically significant result can change drastically depending on the plausible size of the underlying effect…It is not sufficiently well understood that “significant” findings from studies that are underpowered (with respect to the true effect size) are likely to produce wrong answers, both in terms of the direction and magnitude of the effect…

…There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions…In fact, statistically significant results in a noisy setting are highly likely to be in the wrong direction and invariably overestimate the absolute values of any actual effect sizes, often by a substantial factor…”

These inferential problems seem similar to those that can occur if clinical trials are stopped early for ostensible benefit:

https://ahajournals.org/doi/10.1161/CIRCHEARTFAILURE.111.965707

“How many events and what strength of evidence are enough to achieve adequate statistical certainty that stopping a trial early for benefit is appropriate? Numerous examples can be offered in which smaller studies have suggested large mortality benefits with significant probability values, but when larger studies were conducted, the mortality effect was either neutral or in some cases harmful….”

To my novice eye, it seems that inferential problems with ICU trials are similar to those seen when clinical trials are stopped early. But since outcomes of interest in ICU trials tend to occur over a short period of time and extending trial duration to accrue additional outcomes of interest is probably not an option in this setting, the only possible ways to overcome the “scarce data” problem would be to: 1) enrol much larger numbers of patients (suspect this is difficult in ICU settings); OR 2) to use an approach (e.g., ?Bayesian) that more explicitly defines a “plausible” result prior to conducting the study, such that an unexpectedly large effect will only be considered sufficiently compelling to change practice in situations where an optimistic prior had been agreed; OR 3) lowering the p-value required to declare a “win” if a trial is small (?)

llynn · October 3, 2019, 6:14pm

I’m going to help a little because it has been clear that many have stuggled with this. After all this is SOFA a storied organ failure score which correlates with mortality.

Lets unblackbox a little of that mystique. Organ failure always precedes mortailty so correlation is not surprising so we ask “what is organ failure?”

Well in an example in SOFA is a low platelet count. Now the megakaryocytic mass might be perceived as an organ so lets explore the behavior a the signal used by SOFA (the venous platelet count) from this organ in sepsis.

Firstly the platelets may rise, fall or stay the same until death depending on the sepsis phenotype. But how does SOFA quantify this signal. By density threshold 150 = 1 100 = 2, you get the picture.

So even when platelets fall they fall with a negative slope. Very negative in some, others not so negative. Yet what was the phenotypic range from which this slope emerged? Why is that relavant. Well we have an a high onset value we nay have a negative slope, we have a low threshold. Enough said.

Ok lets do the G Coma Scale (GCS). Is this brain failure in the sedated? Perhaps not.

I could go on but all of that is not necessary as those mathematically inclined could simply calculate the number of potential threshold combinations which would render the EXACT same SOFA value of 8, and get the point.

Everything is flawed. I tell threshold scientists that acknowledgement that their guessed gold standards are flawed is a political statement which, like a SOFA itself, is too vague for scientific discourse.

ADAlthousePhD · October 3, 2019, 8:07pm

One note here, though this is admittedly somewhat speculative on my part: I don’t think the trial was designed with intention of detecting a mortality difference. I believe this was somewhat like a “Phase 2” trial in drug development, with intent to lay groundwork for a larger confirmatory (Phase 3, pivotal, whatever word you prefer) trial, and the hope/expectation was that there would be some preliminary indication that Vitamin C was effective, enough to warrant a larger confirmatory study.

I just want to clarify that the “underpowered” nature of the trial for mortality outcomes was not some accident or error on the part of the study team, but by design (much like “pilot studies” are not intended for assessment of outcomes, but to provide useful information to inform subsequent work). I assume that mortality was only on the list of secondary endpoints because it seems foolish not to at least report mortality.

The big mortality difference is a surprise, and leaves us in an awkward spot, because now the camps are probably divided into “Vitamin C works, we can’t randomize more patients” or “Meh, secondary endpoint, underpowered study, weak evidence for Vitamin C” when the correct answer from this is probably “get more confirmatory data, but it’s well worth studying further” (again, I’ve heard that there are other trials active in this space, not sure how similar/different they are to CITRIS-ALI in terms of patient population, endpoints, etc). It would not at all be surprising if the benefit of Vitamin C in a subsequent study is smaller than observed here, but still enough to be useful. So if a future trial is done and people see a smaller benefit than CITRIS-ALI, they should keep that in mind. It doesn’t HAVE to be this big of a benefit to be useful, and it doesn’t mean that CITRIS-ALI was “wrong” about the mortality benefit. As you’ve said, point estimates from small trials are inherently noisy. They can over- or under-estimate the true effect size. CITRIS-ALI doesn’t necessarily tell us what the mortality benefit IS, but it tells us that there’s the possibility of a very large one, all the way down to very little or none at all, and that it may be worth further study.

Hope that all makes sense!

mkashiouris · October 3, 2019, 8:15pm

Andrew, my name is Markos Kashiouris, intensivist at VCU and passionate about biostatistics (MPH for Bloomberg School of Public Health).

Your comments is like water in the dessert of deaf ears…

We are trapped by this, and very few seemed to understand this. There are some subtle points that have not been mentioned:

As you mention above, there are a lot of data that are not missing at random, the treatment influenced the missing data.
HDIVC (High-Dose Intravenous Vitamin C) in addition to a dramatic mortality reduction (at two days 4 dead vs 19), had a dramatic influence on who is getting out of the ICU (graduation rate). Therefore on the placebo arm a lot of sick patients died (thus the mean SOFAs and all the biomarker levels) were lower. What is not very clear is that there is a double hit … in the HDIVC arm. In the trial design SOFA and biomarkers were ONLY obtained in the ICU and not the ward. People who got rapidly better went to the floor. So… the HDIVC arm lost their “healthier” patients and retainer the “extremely sick” who theoretically would have died. I believe this is called Neyman bias.

It is very similar to a study of neonates who found that smoking mothers (treatment) had babies with higher birth weight than then non-smoking mothers. Doctors were confused and were thinking “should we recommend mothers to smoke?”

What really happened is that smoking dramatically increased very early spontaneous abortions (deaths) sickly children who would have survived had the mothers didn’t smoke.

Anyway, thank you for your bright comments and i am very happy that there are people with open mind out there …

Markos Kashiouris
mkashiouris@vcu.edu
804-305-7187

ESMD · October 3, 2019, 9:52pm

Thanks. Indeed, it all makes sense. I realize the researchers weren’t trying to power the study for mortality, but as you say, the problem is what to do with a signal of mortality benefit when we see it. Even if it mortality was a secondary endpoint and we didn’t really intend to look too closely at it, writing off Vitamin C after this result might not be a good idea (?) It’s hard to unsee this kind of finding, even if it’s not exactly clear how to interpret it in a trial this size. As you say, a bigger trial could be helpful (though I don’t know how big it would have to be…) Another MD online asked whether readers would want vitamin C if they developed sepsis, based on the results of this trial. I can see the appeal of jumping in and starting to give Vitamin C routinely in this setting, given the seriousness of sepsis and the few apparent downsides- but this is exactly how practices become entrenched in medicine and sometimes carried on for many years, only to be ultimately reversed after someone finally does the bigger trial.

pau13rown · October 4, 2019, 10:54am

do we know if the null effect on change in sofa score and the biomarkers would have drowned out the mortality difference in a global rank composite? it seems likely, although i like the composite idea for handling missing biomarker data due to the early deaths. Although i’m not sure what the regulators think about mixing disparate outcomes like this

ADAlthousePhD · October 4, 2019, 12:13pm

I would love to dig into this (I could use James Heathers’ SPRITE tool to generate some possible distributions of the SOFA data based on the mean & SD, then add the number of deaths in each arm - someone else pointed out a group in the opposite direction, there was a small number of patients that were discharged from the ICU before 96hours - 9 in the vitamin C group, 1 in the placebo group- so I’d put “deaths” at the bottom as the worst possible outcome, then the patients with SOFA scores, then “discharged alive before 96 hours” as the best outcome). I won’t be able to do it right away, things to do today, but I may be able to do it next week.

pau13rown · October 4, 2019, 1:28pm

would it make sense to also use biomarker data to distinguish between those in the ‘discharged alive before 96hrs’ group? fewer tied ranks

ADAlthousePhD · October 4, 2019, 1:39pm

I’m not 100% certain, but from what I have read about the trial, they would not have the data for these patients (the SOFA scores and biomarkers were only measured for patients that were still in the ICU - if they were discharged from the ICU before 96 hours, those data are not collected).

zshahn · October 4, 2019, 7:01pm

Another option for dealing with situations where an outcome of interest is censored by death is to estimate something called the “Survivor Average Causal Effect” (https://onlinelibrary.wiley.com/doi/full/10.1002/sim.6181). This principle stratification based estimand is the effect of treatment on those patients who would have survived in either arm. Of course, we do not know exactly who these patients are, but we can still estimate the treatment effect among them (under some assumptions) just as we can estimate the complier average causal effect in an IV analysis without being able to identify the compliers.

llynn · October 4, 2019, 9:40pm

Excellent Andrew. SOFA was guessed in 1990s. Threshold science (the use of guessed threshold sets as gold standards for disease) had its onset in the 80s.

SOFA is the core of Sepsis 3 the new gold standard for sepsis based on consensus in 2015 so threshold science still exists but only in a few holdover disciplines.

The guessed thresholds themselves are capricious with the common classic guessed numbers (100, 2, 150, etc) based more on the metric system than human pathophysiology.

SOFA is comprised of:

–5 signals
–4 guessed threshold levels for each signal
–A value of 1-4 is assigned for each ascending threshold level of each signal.
–SOFA Score is the sum of the values all signals.

Note: Estimate Coma score (GCS) if patient on sedation and assign the value accordingly.

https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.mdcalc.com/sequential-organ-failure-assessment-sofa-score&ved=2ahUKEwjH5Z7DwIPlAhVKhOAKHfPOCPwQFjAAegQICBAD&usg=AOvVaw2H1jBD1fkPUCqZHwK6xXqJ

llynn · October 5, 2019, 1:18am

Sorry Andrew: I was incorrect anout SOFA.

SOFA has 6 signals with 5 levels. Not 5 signals with 4 levels as I indicated above.

Medcalc offers the following disclaimer which you might find interesting for your datamethods discussion.

“the SOFA Score is not meant to indicate the success or failure of interventions or to influence medical management.”

Maybe Medcalc should have said:

…‘EXCEPT IN RCT,’ the SOFA score is not meant to indicate success or failure. of interventions or to influence medical management.’

BTW, I’m always available to openly debate . .

llynn · October 5, 2019, 4:59pm

Let me explain this a little better because this is very imprtant if you plan to do stats and discuss APACHE type summation scores like SOFA, MEWS, EWS, MOD, or NEWS. APACHE originated in 1981 and you should read the first article to understand the rationale behind these score. APACHE which has had many sequels itself (now APACHE IV) was the prototypic score from which the rest of the summation scores were spawned.

We will use SOFA as an example because I have already described it in general terms.

Now the literature quotes mean mortalities for each ascending SOFA score (See linked Medcalc site above for these levels and refereces) but each mean SOFA score is actually comprised of a set of discrete combinations of 6 thresholds each combination having its own mean mortality. The actual threshold combinations which comprise each set will vary from trial to trial. The number of combinations which comprise the set defining the score (and therefore number of mortalities) is a function of the number of signals and the number of threshold levels.

This is why SOFA is not reccomended for clinical use. There are far too many layers of averages. Its perceived value its linkage to mortality (again as in linked Medcalc which has original references) but that mortality is much further removed from the instant patient under care than any mean mortality of the cohort would be. Thise who developed APACHE knew this but humans tend to create mental objects. After a while, those objects are perceived as a discrete thing and soon the applied math rolls and everyone talks about the set of ecclectic things (eg APACHE or SOFA) like it is one thing.

Knowing this we don’t need to ask ourselves why, if SOFA is used because of its linkage to mortality, was improvement in the mortality signal so strong while the SOFA signal is unchanged. Well I leave the determination of the probability of that occurence to you fine people who are discussing SOFA and the stats here.

Now one has a choice. She can talk about and apply math to a “SOFA Score” as if it is a simple mortality function (as many clinicians and researchers seem to think it is) or you can investigate it and help the clinicians understand stats as it relates
to all these APACHE type summation scores. That would be a breakthrough paper. A major advance for critical care science.