Modeling categorical variables when the purpose is prediction

Elisabetta · March 9, 2022, 4:53pm

Dear all,

I am a statistician and recently I started to collaborate with people expert in machine learining/artificial intelligence on a study where the main aim is to build a prediction model (to make accurate predictions) given a large set of candidate predictors, continuous and categorical.

So, with regard to the categorical ones with k categories (with k=1,…,K), I used to think that even when the purpose is making predictions, you should include, among the candidate predictors set, only K-1 dummy variables. But I have noticed that the ML/AI guys used to include all of them (so for K=3, 3 summies).
Is that correct?

Thank you in advance for any thought on that.

Elisabetta

f2harrell · March 9, 2022, 11:30pm

My guess is they are effectively using a cell means model whereas you are including an intercept in your model.

Elisabetta · March 10, 2022, 4:46pm

My impression is that they use one-hot encoding “by default”. When is it appropriate and when it is not?
I read that it might be acceptable when you do feature selection. This means that if for example I use the elastic-net I can (HAVE TO?) use the one-hot encoding?
Is there a reference article/book that talk about that?
Thanks a lot,
Elisabetta

Vattaka · March 14, 2022, 6:26am

It doesn’t have to do with feature selection, but it does have to do with regularization. Most ML models are regularized in some way.

You can do this even if there is an intercept term if you use Bayesian or L2/L1 regularized regression. You don’t have to do this when using elastic net but its the other way around- if you do this you basically need some sort of regularization Then the multicollinearity caused by the “overcomplete” one-hot representation becomes a non issue.

Often times for example in NLP the dimension of words (which are categories) is so high that it the model needs to be regularized regardless so you may as well do this. In neural networks or even GLMs (a 1 output layer only NN is a GLM) without a penalty but trained with SGD (stochastic gradient descent) if you do early stopping before the minimum even that amounts to some degree of regularization.

So essentially, its fine to do this when the model is regularized. Sometimes Bayesians also do this so its not just ML.

The advantages are more clear when there is a large number of categories you don’t need to pick a reference level nor do you need to get rid of the intercept. But usually with a large cardinality for a categorical variable or just many categorical variables people would go for an algorithm like CatBoost anyways