Hi everyone,
I’m currently working on a project that tries to build an interpretable and easy-to-use prediction model in the form of a decision tree. The hope is that this model can be used by healthcare workers graphically without needing them to perform any mathematical calculations or computer software. This model is trying to predict mortality for children admitted to hospitals in southwest Nigeria within 7 days of admission. The prediction from the model will then be used to help inform whether or not to put children in intensive care. The data used to build the model is clustered by hospitals, and hence internal-external cross-validation is used as recommended by TRIPOD-cluster. Furthermore, the data is very imbalanced, with only 4% of observations being cases.
There are two types of metrics: discrimination and calibration. Recently, I have encountered many sources that recommend prioritizing calibration when the purpose of the model is to help with decision-making. I’m wondering if that applies to all situations? My team is currently thinking of optimizing and focusing on discrimination, since if we use a probability model, it will be difficult to use for healthcare workers compared to a clear-cut binary classification model, especially if it’s shown graphically in a decision tree. Moreover, we are trying to only produce one final decision tree with consistent prediction across all settings/hospitals. Because of this, we are considering building the decision tree model through upsampling and optimizing for F2-score. We want to use F2-score because we care more about correctly classifying death even if it leads to more false positives (accounting for the cost of misclassification; handling class imbalance). This will distort the model’s calibration but then will produce a model that prioritizes discrimination with a clear-cut binary classification.
All advice would be greatly appreciated.
This article describes in detail why forced-choice classification does not meet the goals of decision making and is even misleading. So risk prediction should be your goal, which will also not lead you to worry about so-called “imbalance”.
One of the ironies of simplification of risk scoring rules is that many papers published in the medical literature have resulted in over-simplified discontinuous scoring systems that result in worse discrimination than just using a single continuous variable such as age as a predictor. You can estimate outcome risk from a single variable using a single-axis nomogram as shown here.
So I suggest that obtaining near-perfect calibration with good discrimination be the goal. Then the question is how to make the calculations as simple as possible if you don’t want to use a smart phone app. @Ewout_Steyerberg’s book Clinical Prediction Models has lots of examples of developing point scores and nomograms. A point score table is just example points from a nomogram. The R rms
package nomogram
function produces point tables when the result is printed, and draws a nomogram when its result is plotted. There are other approaches such as a line graph with several curves on it. I’d like to see others respond with more ideas.
2 Likes
Hi Professor Frank Harrell,
Thanks for the answer. I have a question about assessing the discrimination in the context of internal-external cross-validation. The available literature on internal-external cross-validation seems to focus on using ROC-AUC to assess a model’s discrimination. However, the data that I’m using right now is very imbalanced (4% of all observations are cases), and I wonder whether it’s still a suitable metric to use? Another sources recommend metric such as PR-AUC, which they claimed is more suitable for imbalanced data; ROC-AUC is said to give optimistic performance for imbalanced data. However, currently, there’s no method that allows me to calculate the standard error for PR-AUC. Without the standard error, I can’t perform the meta-analysis procedure to pool the results across the internal-external cross-validation.
So, in short, my question is, can I still use ROC-AUC for imbalanced data? If not, how can I interpret the PR-AUC without the ability to pool the result across the internal-external cross-validation?
What steps are you taking to avoid the risk of displacing or excluding organic, local knowledge & expertise by imposition of this protocolized decision ‘tool’? Are you engaged in a genuine scientific collaboration with local clinician-experts, such that your joint work on this prediction modeling might conceivably yield fresh insights into the problems they face? How will you assess the effects of this tool’s introduction on these admission decisions?
The c-index, i.e., AUROC, has no problem at all with imbalance. But supplement this with other measures such as these and these.