A major selection bias affects a widely used model for predicting hypotension

JohsEnevoldsen · August 27, 2022, 1:50pm

A colleague and I recently published a paper describing how a widely used model for predicting hypotension has been both developed and validated on data impacted by a major selection bias.

In essence, different selection criteria are used depending on the outcome, creating a biased dataset.

The model (Hypotension Prediction Index, HPI) is proprietary and build into a device used for monitoring patients during anesthesia.

I wrote a short summary of it on twitter, pasted below (https://twitter.com/JohsEnevoldsen/status/1561641153899929601)

HPI (a logistic regression) uses features derived from the arterial blood pressure waveform to assess the risk of hypotension in 5 min. It is displayed on the monitor as a number from 0-100. When the predicted risk is high (HPI>85), the device will alarm the clinician.

It was trained on features from 20-sec sections of arterial pressure waveform (predictors). Some labelled hypotension and some labelled non-hypotension (outcomes).

The methods section in the original paper presenting HPI describes the selection of outcomes and predictors (Machine-learning Algorithm to Predict Hypotension Based on High-fidelity Arterial Pressure Waveform Analysis | Anesthesiology | American Society of Anesthesiologists)

Features corresponding to hypotension were selected 5 (and 10 and 15) min before hypotensive events (mean arterial pressure (MAP) <65 mmHg for ≥1 min)

And here comes the problematic part:

A non-hypotensive event was defined as a 30-minute period where MAP was continuously >75 mmHg. Features corresponding to a non-hypotensive event was the middle 20 sec of this 30 minute period.

Therefore, to predict hypotension with 100% specificity, you just had to guess “hypotension” every time MAP was < 75 mmHg. HPI (a logistic regression model) has probably learned this.

Importantly, this selection was used for both training and test data.

We made a simple visualization of what this selection does to a ROC analysis with MAP as a predictor of hypotension (simulated data).

The ROC curve in panel b is skewed towards high specificity, similarly to the ROC curves in the original paper and most validation papers.

Unfortunately, the predictive ability of HPI was not compared to the predictive ability of MAP. Instead, HPI was compared against the change in MAP over 3 min (ΔMAP), which—to no surprise—is not a good predictor of hypotension.

But how should such a model’s predictive ability be validated?

While this bias in the training data has probably flawed the models ability greatly, it is still unclear to us what the best way to measure it’s predictive ability is.

What the clinical wants to know is:
“when the alarm sounds, what is the probability of hypotension in the next e.g. 10 min?” and
"how often is hypotension preceded by an alarm in a reasonable time interval (e.g. 2-10 min).

This is, however, not trivial to code. E.g the HPI comes with a number every 20 second. If it is above 85, an alarm sounds. Should all consecutive HPI values > 85 be counted as separate alarms, or is it only the first? What if drops below 85 and then quickly rises again? is that two alarms, and can they both correctly predict the same hypotensive event?

Comments on the paper, and on the question above, are very welcome.

R_cubed · August 27, 2022, 10:02pm

I’d have thought this type of problem could be examined via time series methods (ie. ARIMA) and an adaptation of resampling know as block bootstrap.

Some good advice can be found in this open access textbook:

JohsEnevoldsen · August 28, 2022, 7:35am

An autoregressive model of MAP would probably be a good baseline model to compare HPI against.

In this case, HPI has already been developed and implemented. It probably has a predictive ability very similar to MAP due to the selection bias presented above. This could be investigated in a ROC analysis similar to the original study (with and without the selection bias) where HPI is directly compared to MAP.

However, this type of analysis cannot answer the clinically relevant question: given an alarm, what is the probability of hypotension in next e.g 10 min? (Positive predictive value).