Using mixed logistic regression model to compare classifiers on test datasets

When people compare classifiers on a common testing dataset, the most common approach I see used to compare them is linear regression using % accuracy.

However, this feels wasteful to me, as it boils down a whole dataset into a single value - and ignores the fact the testing dataset could be very large, and provide high precision.

To give an example, I have a common testing dataset of 5,000 radiology images, and I want to compare families of classifiers (neural networks). Within each family of neural networks (e.g. EfficientNet, ResNet) there will be models of increasing numbers of parameters (e.g. EfficientNetB0, EfficientNetB1 … EfficientNetB7).

My initial thought was instead of regressing accuracy as linear regression, I could instead do:

model <- glmer(correct_prediction ~ model_size + (1 | model_family/specific_model) + (1 | image_id), 
               data = data, family = binomial)

But I’ve not seen this approach before.

Also, as it’s a random effect, rather than a fixed effect I don’t think it would allow me to compare individual families of networks?

Instead, could I do something like this:

model <- glmer(correct_prediction ~ model_family + model_size + (1 | model_family/specific_model) + (1 | image_id), 
               data = data, family = binomial)

Where I include model_family as a fixed effect as well as a random effect.

Finally, I’m almost certain model_size will have a non-linear relationship (moving from small to medium models will offer big boosts, but medium to large less so), so I was thinking about modelling this using a restricted cubic spline, like so:

dat <- datadist(data)
options(datadist = 'dat')

model <- glmer(correct_prediction ~ rcs(model_size, 3) + model_family + (1 | model_family/specific_model) + (1 | image_id), 
               data = data, family = binomial)

Does this seem a sensible approach? As I’ve said, I’ve not really seen people using logistic regression for test scores, and yet to me it feels quite a logical (sorry) approach? And by adding image_id as a fixed effect it’s also allowing me to do a paired analysis for each question, effectively.

For reasons detailed here the proportion classified correctly is not a good measure.