Risk Prediction Problem




I have a data matrix with 2,000 observations and 30 predictors, but only 160 of the 2,000 observations have an observed outcome variable: a 3-class categorical variable (classes A, B, and C). The goal is to rank the remaining 1,840 by the probability of it belonging to class A, and targeting the top 100 for an intervention.

I’m fitting a pre-specified multinomial logistic regression model with elastic net penalty, and using this model (trained on 160 observations) on the full matrix of 2,000 observations. I realize this is a rather weak model, but I have to make do with what I’ve got.

Given the small sample size I’ve avoided using external validation, and instead went with internal validation, repeating stratified 10-fold cross validation 1,000 times. The candidate predictors are mainly count variables, with clumping at zero and many highly correlated. But since this is a prediction/class probability estimation problem, based on what I’ve read I think this is fine.

Does the approach described above sound sensible? I really want to make sure this isn’t complete rubbish, as it can potentially have huge implications - we want to be treating the right ones first!

Thank you very much for your help!



The likelihood that the resulting model will be close to the unknown truth, with a sample size less than say 50,000, is very low. Your effective sample size seems to be 160, which is insufficient for analyzing a single pre-specified predictor when the outcome variable is a low-information categorical one. This is not a fruitful exercise.

This type of question is better suited for stats.stackexchange.com.