Question re: classification model evaluation

js592 · October 12, 2020, 5:39pm

Hi all,

I am in the midst of peer review on a paper developing a probabilistic binary classifier for a geoscience application. Here are a few key properties of the research purpose and dataset:

The dataset is fairly small (~180 cases) and imbalanced towards positive outcomes (~3:1)
Both positive and negative predictions will be of significance to users.
The mis-classifcation cost will vary from use to use.
It is generally assumed that the data collection is biased towards positive outcomes but there is no quantitative estimate of the natural class frequency in the population as a whole.

I used typical predictive modeling tools of up-sampling to compensate the minority class/balance the training data and calculated performance metrics using the ROC curve and AUC on an independent test set, which happened to have the same class ratio as the original training data. One reviewer commented that we should calculate the precision-recall curve/average F-M scores because they are “more sensitive to class imbalance problems” and “can identify sampling bias”.

My initial response to this was that we care equally about accurately predicting positive and negative outcomes, and the dataset is relatively balanced compared to the class skews seen in the papers that call for using PR over ROC. Thus, ROC would be a better evaluation tool. Does anyone have any input on this? I am aware that there are other alternatives to ROC for imbalanced data – we wanted a metric that was widely known and easily implementable by other geoscience practitioners.

The paper that they cite claims that a difference in precision and recall within and between classes (i.e. if the data labels are switched) indicates sampling bias in the training data. This was based on a simulated logistic regression which created two synthetic populations (balanced and skewed) and drawing samples of varying class ratios by fixing the length of the majority class (which they labeled as 0) at 5000 and drawing an appropriate number of the minority class (labeled as 1). They then fit logistic regressions to these samples and calculated ROC and PR curves and performance metrics from the same sample data. Their results showed (as would be expected?) that ROC derived metrics did not change with the class ratio of the testing/training data, but PR derived metrics did – particularly when the minority class was labeled as the class of interest.

The explicit claim in the paper is that their results show that “the difference in the mean Prec, Rec, and F-M within the class and between the classes can be used as an indicator of sampling bias
and the resulting inaccuracy of the predicted probabilities using the MLLR model”

My gut feeling is that this claim is false and are the stated results would be true for any testing data with an imbalanced class ratio independent of the model being evaluated. Does anyone have any insight to shed or this?

f2harrell · October 12, 2020, 6:16pm

You didn’t make a case for forced-choice classification vs. estimation of tendencies (probabilities). Since the latter is usually the goal, none of the methods you mentioned are applicable and you should consider the Brier score, and various measures of explained variation as discussed here.

js592 · October 12, 2020, 7:34pm

This actually raises a good topic of discussion. A point of context: the intent of the model is to predict the probability of a certain geohazard (think landslide, tsunami, etc) occurring at a site given various predictors. The predicted probabilities will be used in a decision making context to determine if the hazard requires mitigation or not. I agree that a proper scoring rule would have been better used in this work – I have to admit I was a bit mislead by the present domination of machine learning literature in the predictive modeling field. Do you have any input on the second part?

f2harrell · October 12, 2020, 9:19pm

The use of proper scoring rules makes “class imbalance” moot.

js592 · October 12, 2020, 11:16pm

Could you elaborate on this a bit? My understanding is that as long as the testing data is representative of the population a model comparison based on a proper scoring rule will identify the one that is closer to the “truth”.