Presentation of the results of a prediction model

I have a prediction model that outputs a probability that the result of an invasive test alters management, using predictors that are easily obtainable before a decision to perform the invasive test is made. Most of the predictor variables are continuous and I have included them as four knot restricted cubic splines. The model calibrates extremely well on bootstrapped internal validation. After external validation, I intend to provide the model as an online calculator.

I want to be able to warn users when the inputted data are at the extremes of the parameter space, i.e. when the particular combination of inputted parameters was extremely rare or never seen in the derivation set.

The easiest solution is to warn the user when individual predictors are at the extremes of the range in the derivation and validation datasets. This is in fact already implemented. But this does nothing for the problem of unusual combinations. One way is to provide the confidence interval of the predicted probability for the combination, which tends to be wide for unusual combinations, but I fear that clinicians will interpret the confidence intervals incorrectly and that this confusion will lead to wrong decisions, on average, given the result. Next I thought I could write a script that checks how many individuals in the derivation set had predictor values within a certain range of the inputted values, and the calculator could provide a statement such as: “Warning: ≤1% of individuals in the derivation set had a combination of values similar to those that were inputted”. However, I don’t like this approach both because the chosen range would always be arbitrary, and there is really nothing to say that a rare combination would result in invalid inference.

Is this a problem other groups have tackled and if so, what solutions were found?

1 Like

I think the confidence intervals should be presented, but carefully so as to not hurt interpretation. As to the approach of estimating the representation of a given subject in the model development sample, the Hmisc dataRep function can help.


In our biomarker based scores for patients with atrial fibrillation (the ABC-AF stroke and ABC-AF bleeding scores) we have limited the input based on the distribution in both the development and the validation data. Limits were not formally defined from percentiles in the distribution but rather as a compromise between looking at extreme values and judgement by clinicians. As well, for the overall predictions we looked at the calibration curves (both internal and external) and only allowed predictions within a range where the models were reasonably well-calibrated. Results outside that range are reported as, e.g., “ABC-bleeding risk >15%” or “ABC-stroke risk <0.2%”. We have a preliminary version of the risk calculators on our homepage abc-risk-calculators where you can try different extreme combinations to see the reporting.


I’m not sure the patient is best served by suppressing predictions when X values are out of range.

When placing limits they should not be based on quantiles but on absolute sample size.

Maybe wasn’t clear enough, but predictions are not suppressed for extreme X values. Rather, extreme X values were deemed “implausible” (especially for some biomarkers) and truncated to the highest (lowest) allowed value and predictions were carried out from that. So, e.g., for age >95 years the age was set to 95 years before making the prediction, avoiding extrapolation outside the support for ages above 95 years in the development (and validation) data.

Regarding truncation of the (reporting of) final predictions, the decision was made based on where we had support for reasonable calibration. Also, the truncation point for the predictions was way higher than any reasonable clinical decision point. E.g., the trunction for stroke was at >10% one-year risk (OAC treated patients) where anything above 3% is usually considered rather high.

1 Like

Ah I see. That’s quite reasonable. I often recommend curtailing of covariate values.

1 Like

Thank you for this response and the reply to Dr. Harrell’s comment, could you elaborate a bit on how you did this? I read the publications in the linked calculator, but couldn’t find a description.

If I understand correctly you allowed users to input biologically plausible values into the calculator, but truncated the values to an extreme percentile of the distribution of that variable in the derivation and validation cohorts? I.e. if age 99 was entered, 95 years was inputted into the model and a warning was issued to the user. Did you extend this concept to unusual combinations of variables? As a clinician I would find it extremely unlikely that someone with a high hs-troponin would have a low NT-proBNP, did you somehow modify the entered values based on the percentiles of the joint distribution of the inputted variables?

Finally, I am unsure how you would go about allowing only predictions within a range where the model was well calibrated when the prediction is the output of the product you are providing? Did you simply not provide the answer if it was outside the well calibrated range? I can understand modifying the entered values based on extremes of the distributions of the variables, but surely you could never modify them based on the output?

1 Like

That’s correct.

No, we didn’t consider joint distributions but that’s a very interesting idea. The challenge would of course be to decide on the bivariate limits. One possibility could be to allow any combination of values but set up some soft limits issuing a warning such as “are you really sure about these values?”.

That is correct, we calculate the predictions based on (possibly truncated) input values but refrain from reporting values outside well-calibrated range. That is, if the prediction is 12% stroke risk we report it as “>10%”. So, no entered values are modified retrospectively, if that is what you meant? Again, in our application, these limits are far from being close to any relevant clinical decision point so it feels safe to just report extreme predictions as above (or below) the limit.


Well, I hate to upset the apple cart, but here’s an argument against confidence intervals for predicted probabilities:


Mike this is behind a high profit margin company’s paywall. Just making trouble :slight_smile:

Ah, no wonder it is never cited LOL! Well, it is a hypothetical conversation between a doctor and a patient. The gist of it is something like this:

Patient: Doc, what are my chances?
Doctor: If I had 100 men identical to you, I would expect
about 70 to be alive 5 years later. I have no idea whether you
will be one of the 70.
Patient: Thanks, that is pretty clear.
Doctor: By the way, the 95% confidence interval for this
prediction is 62% to 78%.
Patient: What does that tell me?
Doctor: It means that I don’t know exactly how many of the
100 men will be alive 5 years later, but I expect between 62
and 78 will. My best guess is still 70, but this is a range for
my guess.
Patient: Does that mean at least 62 but no more than 78 will
be alive?
Doctor: No, but fewer than 62 or more than 78 is quite
Patient: Where does the 95% come in?
Doctor: The casual interpretation is that there is only a 5%
chance that fewer than 62 or more than 78 men will be alive
in 5 years. Technically, that’s not correct. Instead, all I can
say is that if I repeated my work of estimating that interval
an infinite number of times, 95% of my infinite attempts
would successfully include the true risk that you face. Of
course, I can’t make a definitive statement about the
particular interval I just quoted for you.
Patient: This is not very helpful to me as an individual.
Moreover, I think my true risk is 0 or 1. You just can’t predict
it that well.
Doctor: Your point is well taken. Although the casual
interpretation is not correct, just stick to it instead.
Patient: Why did you pick for me the interval where it is
less than 5% likely that my 100 identical patients would not
lie? Couldn’t you have picked a different one?
Doctor: I chose 95% because the scientific community
generally does. My guess is that you don’t care about that, so
we can pick any interval you want. How about 90%?
Patient: Seems pointless to me because only the casual yet
incorrect interpretation begins to make sense to me. I want
to know if I will be alive. You think, out of 100 men identical
to me, about 70 will be alive. You acknowledge more or
fewer than 70 may be alive, which was obvious without
your confidence interval. And 70 remains your best guess,
regardless of the interval you provide for me. So, I feel like
the 70 out of 100 estimate sufficiently conveys the extent of
uncertainty you have regarding my prognosis in this single-
event context.
Doctor: I agree. Sorry for wasting your time with the confidence interval.


Perfect! Smart patient.

Funny enough, that patient told me that everything he even pretends to know, he learned from you, the GOAT!

1 Like

This supports my intuition. I really, really don’t want to provide confidence/prediction intervals. I still think that some kind of indicator of certainty is warranted. I have been considering this for quite some time with no obvious result coming to mind

The one piece of this that should give you pause is that there is uncertainty in which point estimate to give. Is it the MLE (posterior mode if using Bayes)? Posterior median or mean? Bayes wants you to give a distribution not a number. Shouldn’t that make us wary of frequentist point estimates?


It does give me pause but I have no idea what to do during that pause

1 Like

Chiming in here because I have come across similar issues in my research (non-medical). I have personally struggled with providing confidence/credible (uncertainty) intervals for predicted probabilities. In my mind, stating the probability of a binary outcome (conditional on our model) seems like it should already incorporate the uncertainty – specifically in a Bayesian sense we integrate over the uncertainty in the model parameters and incorporate that into our posterior predicted probability. To me this is fundamentally different than calculating something like the proportion of a selected population that might have this outcome which it would make sense to calculate an uncertainty interval for.

To further complicate things, it also seems like two supposedly equivalent modeling paradigms produce different outcomes. Let’s say we are trying to predict the outcome of a sporting match. We could certainly fit a logistic regression and calculate an uncertainty interval for the predicted probability. Or, we could adopt a latent variable approach and come up with the distribution of the points for each team. In which case, we have uncertainty about the points which translates into a single probability of winning for either team. But usually these two approaches are taught as equivalent.

Does anyone have any insight to share?

There is a procedure that we sometimes use to predict drug plasma concentration and account for the uncertainty in pharmacokinetic parameter estimates ( from published population PK studies)
1- draw a random sample of, say, n= 10,000, from the MVN distribution of the parameter estimates ( using the vector of means and variance -covariance matrix)
2- for each set of parameters calculate the prediction, so you will have 10,000 predictions
3- report the 0.975 and 0.025 quantiles

If that were true it would be OK to study 2 patients, find that one had disease, and giving a probability estimate of 0.5 for the next patient.

1 Like

Apologies if this ends up hijacking this thread – I can start a new one if you want. This is a topic that seems to come up in my subfield quite often and I am interested in exploring the topic to bring some clarity to future discussions (not just trying to be argumentative :slight_smile: !). I think we may be talking about slightly different things.

This is the situation I am describing: we have the following usual parametric probabilistic model for observed i.i.d. binary outcomes and single covariate:

y_i \sim \text{Bernoulli}(p_i) s.t. \log(\frac{p_i}{1-p_i})=\beta_0+\beta_1 x_i equipped with suitable priors on \beta_0 and \beta_1

We do our MCMC sampling and find posterior draws \{\beta_0^{(1)}, \beta_0^{(2)},...\}, \{\beta_1^{(1)}, \beta_1^{(2)},...\}. Then for a fixed covariate value we can use regular Monte Carlo simulation to find the posterior distribution of the Bernoulli parameter p:


I think here is where there may be a potential miscommunication because this parameter p can be referred to as the probability. And certainly, we can use these transformed posterior samples to generate an uncertainty interval for p. But what I was attempting to describe previously was predicting the event \text{Pr}(y=1|x=x) for a single future trial where we would simulate Bernoulli outcomes:

\tilde{y}^{(j)} \sim \text{Bernoulli}(p^{(j)})

and compute the usual Monte Carlo estimate of \text{Pr}(y=1|x=x) as \frac{1}{n} \sum_{j=1}^n \tilde{y}^{(j)}.

In my mind this is the same process if we postulated a y_i \sim \text{Normal}(\mu, \sigma) model, found posterior samples for \mu and \sigma (equivalent to p), simulated the posterior predictive distribution \tilde{y}^{(j)}\sim \text{Normal}(\mu^{(j)}, \sigma^{(j)}), and computed the event probability \text{Pr}(\tilde{y}>c) (equivalent to \text{Pr}(y=1). It seems odd to me in this case to come up with an uncertainty interval for this posterior predictive probability.


If that were true it would be OK to study 2 patients, find that one had disease, and giving a probability estimate of 0.5 for the next patient.

Studying two patients seems admissible with what I have described above, supposing we are able to supply an extremely informative prior distribution. In which case I doubt we would give an event probability estimate of 0.5. Or put more succinctly giving a probability estimate of 0.5 is a legal statement conditional on a very bad model.

Do you find a problem with the process I have described above?