Hybrid AI & Delphi Based Derivation of RCT and Bedside Measurements

Have you heard of hybridizing Delphi and AI. Probably not, its new. Yet the dataset used to train AI are derived from samples which might reflect significant bias. What happens if this occurs? This is one of the most pressing questions at the intersection of AI and traditional statistical analysis.

The present state of RCT for critical care syndromes is in flux. For the past 35 years the RCT measurements defining the syndrome under test were simply “expert guessed” or derived by the 1980s Delphi consensus derivation method. This approach has failed.

This linked study offers a new “hybrid” approach for deriving threshold RCT measurement and clinical score derivation combining machine learning and the Delphi method.

While it may seem reasonable to discount the entire result, given the failure of one-size-fits-all threshold measurements to measure a syndrome comprised of many hundreds of infection types accross the world, the authors try hard to render data driven results and should be commended for trying to step away from (or improve) the old syndrome RCT measurement guessing method.

Here though, we see how bias associated with combining expert opinion (Delphi Method) with AI/ML can produce interesting results.

For example the platelet threshold of 100 is identicle to the platelet threshold of 100 from SOFA (guessed in 1996).

Futher we see they apparently use SOFA and other legacy guessed scores as a features along with the raw data for ML training.

The conclusion that the threshold for platelets is the exact same number (100) as guessed legacy scores may indicate that bias crept into the ML.

Indeed 100 and 10 were favored numbers in the old 20th century legacy thresholds reflecting “number bias”.

This is one danger of AI, that it may be trained on biased datasets or that the datasets chosen for training are defined by bias.

The most striking occurrence here is the absence of a score for myeloid failure. This may be due to to bias against the legacy Myeloid thresholds of SIRS (WBC of 12 or Bands 10%) which were guessed and certainly set as too sensitive. Measurements from the myeloid system have been used as part of the standard for almost 35 years and now the whole system is abandoned.

The myeloid system had already been deleted from the adult sepsis measurement when SEPSIS 3 (which was based on SOFA) replaced SIRS in 2015. The authors state they had a goal of harmonizing with the adult measure. One possible example of legacy measurement bias

I bring this here because one of the risks of AI is that the massive datasets make the thresholds appear valid. Certainly platelets, lactate and the other signals are death signals in infection, so correlation with death is expected and does not validate a particular threshold for clinical use as measure of a syndrome comprised of a SET of diseases.

Here they also offer a clinical decision tree for a patient based on the derived thresholds.

From a worldwide AI and Delphi processed data set to the beside of a 6 year old in Kenya and a 14 year old in St. Louis, this is a scary proposition for anyone understanding the limitations of syndromes and AI.

This is being discussed on X but we need this groups help here or join in on X. These are the world’s children in a state most vunerable, a time to engage the world’s great statistical minds

We are at the cusp of the introduction of AI, and here it starts with the derivation of measurements for bedside clinical and RCT use. This has to be right. I’m hoping for many comments.

This paper is a major accomplishent. Profound amount of organization and work. Yet, is it effective to pool data from accross the world, apply AI, and derive a threshold value for clinical use in the instant patient? Is that an effective use of AI? It’s a question that must be answered soon given the promulgation of this standard which may be clipped to the pediatric ward walls within days.

Its worthwhile to think of the heterogeneity of those pooled cases. Can we generalize such broadly aquired results to the local, instant patient under care and can we apply them as measurements for RCT?


1 Like

Hi Lawrence

“Even in settings where not all variables are available, the Phoenix Sepsis Score is designed to accurately identify children with sepsis.”

I’m not sure I understand how the derivation of these scores differs from the derivation of scores used in adult critical care. You seem to be saying that these authors’ methods were more “data-driven”(?) But even if the methods these authors used were more data-driven, the goals of sepsis clinical scoring systems still feel a bit muddled and maybe overly broad (?)

I can totally appreciate why it’s important to be able to quickly identify kids who are really sick. In these situations, labs need to be drawn more quickly, antimicrobials started promptly, closer monitoring arranged, and the need for life support anticipated. These goals are clearly predictive in nature and maybe this is where a well-designed scoring system could shine. The biggest risk posed by a suboptimal predictive tool would probably be underestimation of illness severity, leading to diagnostic and therapeutic delay or inappropriate hospital discharge.

It’s quite another challenge altogether to try to identify specific therapeutic interventions (other than early detection) to improve outcomes. And if the history of adult sepsis trials is any guide, an RCT that draws from a target population as nonspecific/noisy as “really sick children” (as identified using a clinical prediction score) probably isn’t going to lead to any therapeutic breakthroughs…

1 Like

This is why use of thresholds of nonspecific lab values as surrogate measurements of a syndrome comprised of a large heterogeneous set of diseases has the potential to delay care. These are severe thresholds, e.g. platelets of 100, lactate of 5. They are more specific but they are breached late.

The baseline values, the slopes, the time of presentation in relation to disease onset all are variable. A value that has not reached threshold may do so in 6 hrs. when it is untested or too late to intervene with the best chance for a favorable outcome. We are provided no prospective outcome data using this decision tree and it is quite possible outcome would be worse than expert assesment without scoring.

Note the decision points in this diagram.



Such scores are ubiquitous and are useful for assesing the general severity of the aggregate patient populations, but not for individual decision making.

However, this is the beginning of the evolution of AI into decision making and RCT measurements. In this paper from 2019 I describe the transition from threshold decision making to AI. However, here, with a new sepsis score, we have thresholds defined by AI.


1 Like

So I guess the question is how to find out, going forward, whether application of the above algorithm is causing net benefit or net harm. Would it be necessary, in a trial, to randomize kids entering a hospital to either “standard MD assessment” (no formal scoring systems used) or “protocolized assessment” arms of an RCT, then to assess survival in each trial arm? But in that case, how would physicians even decide which kids should be offered randomization/inclusion in the trial when they come in the door of the hospital (?) The first step in enrolling patients in a trial is to identify those who have the disease/condition in question. But if disease definition is itself a source of controversy, then how does trial eligibility get defined? And even if it were possible to clearly define eligibility criteria for such a trial, how large a trial would be needed, considering mortality rates among children identified to have sepsis, to show a benefit from a protocolized approach to patient management?

Erin, IMHO, testing these thresholds using RCT would be problematic. Lactate of 5 in the presence of acute infection is a very late signal for death. What experienced clinician is going to wait (during a trial) until that threshold is breached?

This is an opportunity to see how the ML/AI makes mistakes seeking high specificity. The AI does not know it is trading delay (higher death) for specificity. Contrast this with the quote from a more traditional (non-AI) clinical study below examining the lactate as a signal for mortality in the ER.

“Our results showed that for all patients presenting to the ED, a rising lactate value is associated with a higher mortality. This pattern was similar regardless of patients’ age, presence of infection or blood pressure at presentation.”…

“Thus, a serum lactate level greater than or equal to 2.5 mmol/L was 76% (95% CI 65% to 88%) sensitive and 71% (95% CI 68% to 73%) specific for early death. Lactate levels greater than or equal to 4.0 mmol/L were 55% (95% CI 41% to 68%) sensitive and 91% (95% CI 90% to 93%) specific for early death. A multiple logistic regression model controlling for age showed lactate levels greater than or equal to 4.0 mmol/mm3 to have a 9.5 (95% CI 5.5 to 17) odds of death within 3 days, whereas a bandemia greater than 5% had an OR of 3.7 (95% CI 2.0 to 6.6).”

This is a problem with using mortality as an endpoint and a threshold of nonspecific death signals like lactate as a diagnostic measurement. The ML immediately tracks on death signals because death (not sepsis) is the endpoint but the ML does not know the difference between types of endpoints. For example the ML does no know the the difference between a pregnancy test threshold detecting the true (pregnant) state, and, a sepsis test threshold detecting a surrogate true (future death) state.

Here we can see that a “sepsis score” which contains a death signal (like lactate) is not validated by showing the “sepsis test” is directly correlated with mortality because the sepsis score contains a mortality signal itself.

A major problem comes when these late (and therefore more specific) thresholds are incorporated into decision trees. Late diagnosis is then embedded in the clinical care.

It woild be helpful for input relevant how one best optimizes a threshold for a diagnostic tests which incorporates a nonspecific death signal like lactate or the platelet count.

However, its clear that if a nonspecific death signal is used for diagnosis of a “syndrome”, one has to have one threshold for timely detection in a decision tree and another for more specific diagnosis as a measurement for RCT.

The AI/ML doesn’t know this and there is trust in AI/ML. That’s why its urgent that statisticians ask the questions about threshold measurements for decision trees and RCT especially when the same threshold is used for both.

When one is using AI/ML to develop a diagnotic test for a syndrome, the death signals outweigh the more specific signals (like the myeloid system (e.g. bandemia) in sepsis) but that does not mean less weighty signals can be abandoned.



1 Like


Looking at the algorithm in the paper itself, I realize that I misinterpreted it. There’s a caption underneath it (not shown in your post above) which corresponds to the superscript “a” after “Screen for sepsis”:

a Institutionally available procedures to identify deteriorating patients with infection should be followed for screening. There is a need for data-driven tools to screen children at risk of development of sepsis, which must be rigorously evaluated in different populations and contexts. The Phoenix Sepsis Score is not intended for early screening or recognition of possible sepsis and management before organ dysfunction is overt.

So it doesn’t look like the scoring system is meant to replace a physician’s intuition or clinical impression when it comes to identifying kids who look “really sick.” Rather, once sepsis is clinically suspected (presumably using signs/symptoms, initial testing, and clinical gestalt), the clinician is then supposed to use the scoring system to assess for the presence/severity of organ dysfunction. And only if a certain minimum Phoenix score is obtained would the clinical record identify the patient as “having” sepsis. The authors note that consensus around which kids have sepsis is important because these judgements will inform disease surveillance/outcome monitoring and permit anticipation of need for patient transfer. Since a given score doesn’t really seem to dictate any particular clinical intervention (or result in the withholding of any particular intervention), I don’t see how it could harm patients.

Having said this though, I do wonder about the “Quality improvement” and “Enrolment in clinical trials” goals of the scoring system. As we’ve discussed in this forum many times, signals of therapeutic efficacy tend to be dwarfed by statistical “noise” when syndromes are used as inclusion criteria…

1 Like

Thanks Erin

As a rule, if the success of an intervention is dependent on its timing and the intervention is triggered by a threshold, then such a threshold can do harm if triggered late. When a threshold is promulgated, physcians and nurses perceive it as evidenced based and many will act as if it is a decision point.

Here is the decision tree.


So, yes, the first portion of the decision tree includes screening and treating infection using clinical grounds but the designation of “sepsis” (and that associated treatment) occurs at extreme death signal levels (Lactate >=11) or (Lactate >=5 + platelets of <=100). These values are certainly specific but such a patient is on a continum and near death.

My point is that as the patient is dying the death signal (e.g. lactate) rises progressively with a rapid or slower slope. The ML picked 11 (near death) to render the diagnosis…why?

I can see how this might just be a tool to detect the dying for an intervention trial but if it is used as triage it can cause delay.

In our past work we taught the “patterns of unexpected hospital hospital death” to explain to nurses how thresholds can lead to a false sense of security and delay. It is a myth that thresholds render a result that is “better late then never” because failure to breach a threshold is often used to wrongfully decide against escalation of care (false sense of security).

However, the bigger issue for this forum is that, these ML derived thresholds define a “Synthetic Syndrome” capturing a SET of diverse deadly progressive infections at a very late clinical stage designated as “sepsis”. Earlier thresholds for this purpose have not worked but these thresholds will produce a much more sick population for study which is easily identified.

Therefore, for the problem of the low signal to noise due to the broader and earlier thresholds of SIRS, this might help. However, as many have discussed in this forum, its unlikely that that was the primary reason sepsis RCT have failed for the last 30+ years.

I get all that, but my point is that the algorithm isn’t saying to the physician “you can just relax until lactate goes higher than 11.” My impression is that the purpose of assigning the score is primarily to assist with epidemiological/public health monitoring rather than to dictate any particular clinical intervention. I completely agree that it could be very problematic if physicians were to misuse the scoring tool and start using it instead to overrule their clinical instinct that a child is really sick- this is a real risk that needs to be carefully considered and actively mitigated. But that isn’t the stated purpose of the score.

Agree, it would be good for that.

Compelled by lactate, I deviated into the clinical discussion. That’s not my purpose here.

My goal in this forum is to discuss the RCTcomponents which are assumed valid by a Cochrane review such as the origins of these standard RCT measurements, the one-size-fits-all parsimony, and dichotomization of rapidly sloping acute signals.

In the end I hoped to shine a light on 30years of nonreproducibilty of RCT applied to fluidic sets comprised of a mix different diseases with different ATEs (which mix changes with each new RCT).

This 2024 Peds sepsis threshold set shows that my efforts have not been sufficiently successful.


This article makes my point about the threshold of lactate in the newly promulgated Phoenix Sepsis Score (PSS).

In PSS a lactate of 10.9 gives only 1 point (No sepsis) but is off this scale…

Its also a beautiful picture of the age relationsip of lactate mortality.

There are multiple reasons why a young patient survives a higher lactate.

However lactate of 11 to diagnose sepsis in the worlds children? That, I believe could only come from a machine crunching numbers with no idea what it was doing.

A child can survive a higher lactate so a machine using mortality as a surrogate end point will select a higher lactate as the “true state”. This is the twisted way that using AI /ML can cause harm.

Expert critical care physcians have long known the “heathy paradox”, that vigorous young may trigger thresholds late because they have more compensatory reserve.

The AI does not know this.

In my view this new standard set of thresholds derived from ML should be a wakeup call for reliance on “black box” clinical decision making.


Lactate is perceived as an signal to incorporate into a threshold SET under Pathological Consensus because the production of excess lactate by the human body occurs due to reduced blood flow, low tissue oxygen levels, inadequate O2 delivery for the O2 demand, and high adenergic states. All these events are typical of many death processes.

Lactate has a lower pK then Bicarbonate so it gives its hydrogen ion to the bicarbonate. As lactate rises bicarb falls and a fall in pH (metabolic acidosis) evolves. This triggers an increase in heart rate & respiratory rate and eventual collapse.

By the time the lactate is 11, acidosis is often quite advanced.

When respiratory compensation (hyperventilation) is exhausted, the heart rate falls and the patient dies. This is the classic type I signal pattern of death.

Here you can see why the young survive higher lactates levels. Respiratory compensations, which mitigates the fall in blood pH, is pivitol for prolonged survival (providing enough time for the patient to recover)

The new hack for sepsis criteria (measurement for RCT) is to add a threshold lactate to a criteria SET. (ie Sepsis 3 is SOFA plus lactate).

Then validate the sepsis score by showing its correlation with mortality since sepsis is often a fatal disease if not timely treated.

The hack comprises adding a death signal to the sepsis criteria (Threshold SET) and then validation of the Threshold SET with death. This is guaranteed validation.