Predictive modelling for continuous outcome

most of the literature considers the scenario of a binary outcome, eg the new bmj series on Evaluation of clinical prediction models. Likely because as Harrell’s RMS book says: “in medicine, two of the major areas of application are diagnosis and prognosis”. Early in the bmj paper (above) they note “clinical decision making can also be informed by models that estimate continuous outcome values”. The paper they reference refers to the “minimum 10 people per candidate predictor rule of thumb” and they use backward selection etc, so I hesitate to be guided by them, it feels crude and not aligned with current thinking re stepwise methods. Searching the web has produced nothing (all the search results insist i’m interested in binary outcomes) and i can’t see anything quickly in the RMS book.

What are the equivalent methods to ROC curves when the outcome is continuous? Are we compelled to apply a threshold? i have seen this, ie authors producing multiple ROC curves for multiple assumed thresholds. I suggested this to the clinical client but they said the outcome should remain continuous (because this is how clinicians think about it).

If someone can point to a solid paper where they are building a risk prediction model for a continous outcome that would be appreciated. Thanks!

incidentally, i asked grok (twitter’s ai bot) and it said:

  • Model Development: Choose an appropriate regression model for your continuous outcome. Some popular options include linear regression, Poisson regression, or generalized linear models (GLMs).

  • Model Fitting: Split your dataset into training and testing sets. Fit your chosen model to the training data and assess its performance using the testing data.

  • Model Evaluation: Evaluate the model’s performance using metrics like the coefficient of determination (R^2), mean squared error (MSE), or mean absolute error (MAE). These metrics will help you understand how well the model predicts the outcome.

  • Model Validation: Validate your model using techniques like cross-validation or bootstrapping to ensure its robustness and generalizability.

  • Model Interpretation: Interpret the model’s coefficients to understand the relationship between the predictors and the outcome. This will help you identify the risk factors associated with the outcome.

i asked: but isn’t r^2 criticised because it’s data dependent:

You’re absolutely right, R-squared has been criticized for being data-dependent. It’s like that one friend who always tells you what you want to hear, but deep down, you know they’re just trying to make you feel good. … It only tells you how well your model fits the data you’ve used to create it, not how well it will predict new data. … R-squared is a bit like a chameleon – it can change its colors to fit the situation, but it’s not always the most reliable indicator of a model’s true predictive power. So, keep an open mind, and don’t let R-squared fool you into thinking your model is the next big thing in risk prediction modeling

Sounds like a paper I want to avoid reading.

I like R^2 and indexes covered here. It is somewhat data dependent but still quite useful.

Continuous Y is easier to deal with than categorical or censored Y. The paper should not have discussed cut points and certainly should respect the universal advice that cut points be used, if at all, only on outputs, never on inputs to an analysis. It is very easy to get \Pr(Y \geq y | X) once the continuous model has been fitted for Y. With ordinal models it is even easier and doesn’t assume a distribution shape.

The only fundamentally different things about continuous Y are (1) you need to have properly transformed Y if not using an ordinal model, and (2) the higher effective sample size. When Y doesn’t have serious measurement errors, the effective sample size is n for n subjects. For a binary Y ESS is 3np(1-p) where p is the fraction of subjects with Y=1.

RMS course notes has a new case study for continuous Y with linear models here.


many thanks. Very useful, a kind of “forget everything you thought you knew about X”. Specificity, sensitivity, roc curves, nri, c-index - all out. AIC, R^2, ‘relative explained variation’ etc - in. I think the data scientists are using the former and i have been told they are setting the standard and i should match them. Maybe not!

Matching them would be a terrible mistake :slight_smile:


Don’t! If you “match them” you would be doing things like:

  • Splitting a dataset of n=100 into a training dataset of n1=60 and n2=40 like a dogmatic requirement
  • Categorizing a perfectly good and information rich continuous response
  • Predicting negative counts when the response is a count measure and then remapping those to 0
  • Ignoring censored data and survivorship bias
  • Conflating well-established terminology (e.g. inference is now meant by the ML community to solely mean “predict”. Similar arguments for "counterfactual and “DAG”).

These are some of the things witness times and again in my work at the intersection of statistics and data science.


Sorry for joining your conversation. My comments may be naive and long-winded, but I am trying to understand something. I have been reading the RMS book and also noticed the new BMJ series. I want to confirm whether I understand your conversation correctly to ensure that I am not making any mistakes.

My doubt is about R2. Prof. Greenland seems to be against R2 (maybe I misunderstood something, but I saw his comments before that R2 has limitations). I feel that Prof. Harrell still recognizes R2. So my approach is to use R2 according to the RMS book.

Additionally, my understanding is that when the outcome is continuous, the mean square error (MSE) or mean absolute error (MAE) is a good indicator to evaluate performance. If my idea is wrong, please let me know.

I know that backward selection (stepwise regression) is considered bad practice. Are “minimum 10 people per predictor” and “clinical decision-making can also be informed by models that estimate continuous outcome values” bad practices?(Of course, using cutoff is bad) If so, could you please provide some detailed explanations? Thank you for your kind help.

You stated things well before the last paragraph. The 15:1 rule (effective sample size of 15 per candidate parameter) is a rough rule that needs to be replaced by the in-depth sample size method of Richard Riley at al in Statistics in Medicine. And clinical practice can easily be informed by continuous outcome variables such as blood pressure.

1 Like

Wouldn’t optimism-corrected indices ameliorate that? I think by bootstrapping R^2 (and other indices), it gives us some sense of future performance on data from the same stream.

I think the issue is with the index itself, not its bias. Perhaps one can think of R^2 as an excellent way to quantify explained outcome variation for the dataset at hand and other datasets that are similar to it. Quantities such as mean absolute errors may possibly be more useful for describing likely future performance. I’d like to see more discussion about this.

Regardless of which metrics you use, once you have settled on a model (or set of models), plotting predicted vs. observed with the identity line overlaid is always an interesting visualization at testing/external validation time. It will provide you with a general idea of what the true outcome values might be, given a predicted value.