When people compare classifiers on a common testing dataset, the most common approach I see used to compare them is linear regression using % accuracy.
However, this feels wasteful to me, as it boils down a whole dataset into a single value - and ignores the fact the testing dataset could be very large, and provide high precision.
To give an example, I have a common testing dataset of 5,000 radiology images, and I want to compare families of classifiers (neural networks). Within each family of neural networks (e.g. EfficientNet, ResNet) there will be models of increasing numbers of parameters (e.g. EfficientNetB0, EfficientNetB1 … EfficientNetB7).
My initial thought was instead of regressing accuracy as linear regression, I could instead do:
model <- glmer(correct_prediction ~ model_size + (1 | model_family/specific_model) + (1 | image_id),
data = data, family = binomial)
But I’ve not seen this approach before.
Also, as it’s a random effect, rather than a fixed effect I don’t think it would allow me to compare individual families of networks?
Instead, could I do something like this:
model <- glmer(correct_prediction ~ model_family + model_size + (1 | model_family/specific_model) + (1 | image_id),
data = data, family = binomial)
Where I include model_family
as a fixed effect as well as a random effect.
Finally, I’m almost certain model_size will have a non-linear relationship (moving from small to medium models will offer big boosts, but medium to large less so), so I was thinking about modelling this using a restricted cubic spline, like so:
library(rms)
library(lme4)
library(lmerTest)
dat <- datadist(data)
options(datadist = 'dat')
model <- glmer(correct_prediction ~ rcs(model_size, 3) + model_family + (1 | model_family/specific_model) + (1 | image_id),
data = data, family = binomial)
Does this seem a sensible approach? As I’ve said, I’ve not really seen people using logistic regression for test scores, and yet to me it feels quite a logical (sorry) approach? And by adding image_id
as a fixed effect it’s also allowing me to do a paired analysis for each question, effectively.