Predictive modelling for continuous outcome

most of the literature considers the scenario of a binary outcome, eg the new bmj series on Evaluation of clinical prediction models. Likely because as Harrell’s RMS book says: “in medicine, two of the major areas of application are diagnosis and prognosis”. Early in the bmj paper (above) they note “clinical decision making can also be informed by models that estimate continuous outcome values”. The paper they reference refers to the “minimum 10 people per candidate predictor rule of thumb” and they use backward selection etc, so I hesitate to be guided by them, it feels crude and not aligned with current thinking re stepwise methods. Searching the web has produced nothing (all the search results insist i’m interested in binary outcomes) and i can’t see anything quickly in the RMS book.

What are the equivalent methods to ROC curves when the outcome is continuous? Are we compelled to apply a threshold? i have seen this, ie authors producing multiple ROC curves for multiple assumed thresholds. I suggested this to the clinical client but they said the outcome should remain continuous (because this is how clinicians think about it).

If someone can point to a solid paper where they are building a risk prediction model for a continous outcome that would be appreciated. Thanks!

incidentally, i asked grok (twitter’s ai bot) and it said:

  • Model Development: Choose an appropriate regression model for your continuous outcome. Some popular options include linear regression, Poisson regression, or generalized linear models (GLMs).

  • Model Fitting: Split your dataset into training and testing sets. Fit your chosen model to the training data and assess its performance using the testing data.

  • Model Evaluation: Evaluate the model’s performance using metrics like the coefficient of determination (R^2), mean squared error (MSE), or mean absolute error (MAE). These metrics will help you understand how well the model predicts the outcome.

  • Model Validation: Validate your model using techniques like cross-validation or bootstrapping to ensure its robustness and generalizability.

  • Model Interpretation: Interpret the model’s coefficients to understand the relationship between the predictors and the outcome. This will help you identify the risk factors associated with the outcome.

i asked: but isn’t r^2 criticised because it’s data dependent:

You’re absolutely right, R-squared has been criticized for being data-dependent. It’s like that one friend who always tells you what you want to hear, but deep down, you know they’re just trying to make you feel good. … It only tells you how well your model fits the data you’ve used to create it, not how well it will predict new data. … R-squared is a bit like a chameleon – it can change its colors to fit the situation, but it’s not always the most reliable indicator of a model’s true predictive power. So, keep an open mind, and don’t let R-squared fool you into thinking your model is the next big thing in risk prediction modeling

Sounds like a paper I want to avoid reading.

I like R^2 and indexes covered here. It is somewhat data dependent but still quite useful.

Continuous Y is easier to deal with than categorical or censored Y. The paper should not have discussed cut points and certainly should respect the universal advice that cut points be used, if at all, only on outputs, never on inputs to an analysis. It is very easy to get \Pr(Y \geq y | X) once the continuous model has been fitted for Y. With ordinal models it is even easier and doesn’t assume a distribution shape.

The only fundamentally different things about continuous Y are (1) you need to have properly transformed Y if not using an ordinal model, and (2) the higher effective sample size. When Y doesn’t have serious measurement errors, the effective sample size is n for n subjects. For a binary Y ESS is 3np(1-p) where p is the fraction of subjects with Y=1.

RMS course notes has a new case study for continuous Y with linear models here.

2 Likes

many thanks. Very useful, a kind of “forget everything you thought you knew about X”. Specificity, sensitivity, roc curves, nri, c-index - all out. AIC, R^2, ‘relative explained variation’ etc - in. I think the data scientists are using the former and i have been told they are setting the standard and i should match them. Maybe not!

Matching them would be a terrible mistake :slight_smile:

2 Likes

Don’t! If you “match them” you would be doing things like:

  • Splitting a dataset of n=100 into a training dataset of n1=60 and n2=40 like a dogmatic requirement
  • Categorizing a perfectly good and information rich continuous response
  • Predicting negative counts when the response is a count measure and then remapping those to 0
  • Ignoring censored data and survivorship bias
  • Conflating well-established terminology (e.g. inference is now meant by the ML community to solely mean “predict”. Similar arguments for "counterfactual and “DAG”).

These are some of the things witness times and again in my work at the intersection of statistics and data science.

3 Likes

Sorry for joining your conversation. My comments may be naive and long-winded, but I am trying to understand something. I have been reading the RMS book and also noticed the new BMJ series. I want to confirm whether I understand your conversation correctly to ensure that I am not making any mistakes.

My doubt is about R2. Prof. Greenland seems to be against R2 (maybe I misunderstood something, but I saw his comments before that R2 has limitations). I feel that Prof. Harrell still recognizes R2. So my approach is to use R2 according to the RMS book.

Additionally, my understanding is that when the outcome is continuous, the mean square error (MSE) or mean absolute error (MAE) is a good indicator to evaluate performance. If my idea is wrong, please let me know.

I know that backward selection (stepwise regression) is considered bad practice. Are “minimum 10 people per predictor” and “clinical decision-making can also be informed by models that estimate continuous outcome values” bad practices?(Of course, using cutoff is bad) If so, could you please provide some detailed explanations? Thank you for your kind help.

You stated things well before the last paragraph. The 15:1 rule (effective sample size of 15 per candidate parameter) is a rough rule that needs to be replaced by the in-depth sample size method of Richard Riley at al in Statistics in Medicine. And clinical practice can easily be informed by continuous outcome variables such as blood pressure.

1 Like

Wouldn’t optimism-corrected indices ameliorate that? I think by bootstrapping R^2 (and other indices), it gives us some sense of future performance on data from the same stream.

I think the issue is with the index itself, not its bias. Perhaps one can think of R^2 as an excellent way to quantify explained outcome variation for the dataset at hand and other datasets that are similar to it. Quantities such as mean absolute errors may possibly be more useful for describing likely future performance. I’d like to see more discussion about this.

Regardless of which metrics you use, once you have settled on a model (or set of models), plotting predicted vs. observed with the identity line overlaid is always an interesting visualization at testing/external validation time. It will provide you with a general idea of what the true outcome values might be, given a predicted value.

1 Like

with the adequacy measure the base model matters a lot. At first i used the null model, but then I wondered what happens if i use another. Eg, if i have models:

    1. null
    1. age + sex
    1. age + sex + age*sex
    1. age + sex + age*sex + X

and i do adequacy for X as: (3 v 1)/(4 v 1) i get something like 95% suggesting model 3 is doing a lot of the work. But if i do it as: (3 v 2)/(4 v 2) i get something like 10%. Assuming there isn’t an error in my code (i cant use the rms package), i guess the conclusion is that sex contains a lot of the information, and the contribution of the interaction is relatively small if the base model is age + sex and the full model is age + sex + age*sex + X.

It can be a little difficult explaining it to clinicians when LR gives highly sig p-value for the interaction but the adequacy for sex + age versus sex + age + sex*age is 95%

“Sig p-value” is sample-size-dependent whereas relative \chi^2 is not. Interactions can be very significant (and often needed) even when they have a small partial R^2.

Adequacy of a model without X would be judged by 3 v 4, and subtracting from 1.0 would provide the relative predictive information of X.

many thanks. And by 3 v 4 you mean (-2loglik for model 3 minus -2loglik for model 1)/(-2loglik for model 4 minus -2loglik for model 1). I really like the plot of relative explained variation here (at the bottom): https://www.fharrell.com/post/addvalue/. I may do this for ‘fraction of new information’ as terms are added to the model, i think the clinical people might like that

Yes that is what I was referring to.

1 Like

hello again. I’ve become confused here. The first link you provided points to likelihood-based methods and the 2nd link (analysis continuous Y) points to OLS and R2 etc. Is it right to use the quantities described in the first link, ie adequacy and 1-adequacy for the example i described, ie continuous normal Y? I cannot see such an example in the RMS book (the Califf paper using adequacy is for TTE). Although chp9 on mle begins with the case of binary outcome, in 9.3 we have the general case and it says “for large samples LR have a chi-sqr distn”. Thus everything seems in order, the scale param is unknown for the Normal case, but for large n i am ok to use LR test?

You can use relative log-likelihood for the Gaussian linear model case but it is not customary. LR \chi^{2} = -\log(1 - R^{2}). Customary is to use R^2 and ratios of R^2 to get relative explained variation, as these ratios are related to ratios of various sums of squares for Y explained.

1 Like

I assume Frank’s reply is tongue in cheek considering he is an author on said BMJ series!

When I see most papers assessing continuous predictive accuracy, they usually calculate the RMSE. You can look up any given paper on imputation method accuracy and there’s a good chance that’s what they’ve used.

For the variable selection process, in my experience with clinical predictive model development for both binary and continuous outcomes, the process is the same. It pays to begin with the exact clinical question you have about the modelling process. What is the outcome, what are the variables that are important to that outcome based on clinical expertise, who is the patient population, how do you anticipate your predictions will affect the predicted outcome, etc. Often, most of the explanatory power is being carried by a small, clinically relevant set of variables that can be identified with a subject matter expert. David Hand has a very nice paper about how a handful of variables typically do the bulk of the work: https://arxiv.org/pdf/math/0606441

For what it’s worth, I would also avoid using LLMs to guide you to an answer. They are more than likely just regurgitating whatever poor advice has been dumped into blog posts that have themselves been partly written by LLMs.

Regarding threshold selection, if you absolutely must select a threshold, this is an area that myself and my colleagues have been working in recently. We developed the predictNMB package in R to apply some simulation methods to the question (it’s still a WIP so far but it is up on CRAN). We developed it for binary prediction but now that I think about it, it shouldn’t be too hard to extend it to continuous outcomes. You can essentially calculate the costs and outcomes resulting from misclassification and then find the threshold that minimises said costs/outcomes.

1 Like

what kind of R2 do you see in medicine? it seems there is much noise and a model can explain maybe 10% of the variation in the data. In the RMS book, examples show R2 ranging from 10 to 50 think

R^2 of 0.05 is frequently considered excellent when (1) the outcome is really hard to predict (e.g., day of death) and (2) competing models have even lower R^2.

2 Likes