Prediction models for class imbalances

ericjelovsek · June 7, 2019, 6:03pm

Hello everyone. I have a question regarding how to handle class imbalance in model building. We are trying to predict an outcome using a cohort of approximately 150,000 observations. Our outcome of interest is dichotomous and present in the dataset at a rate of about 4-5% (approx. 7200 observations).

From my reading, methods such as logistic regression and lasso should handle class imbalances well especially, when used with proper accuracy scoring rules. My question for the group is are other modeling methods useful to explore in this case and is there a role for upsampling or hybrid sampling methods in these imbalanced circumstances?

f2harrell · June 7, 2019, 10:45pm

Any method requiring discarding some of your data is invalid. Any method that estimates probability of class membership will be OK. When an event is not very common, it is most appropriate to model tendencies (probabilities).