Regularization without standardization

The typical advice when using common regularization techniques like LASSO or ridge is to standardize the predictors before fitting the model since the penalties are a function of the size of the coefficients, which depends on the scale of the data. Yet I don’t like to standardize predictors because I don’t like my estimates to be confounded by the mean and standard deviation of the predictor, which are dependent on the sample I use to fit the model. If I use Bayesian methods instead, well there again the experts like Gelman et al. suggest standardization to make default priors possible when there isn’t much prior information available. What is a workable solution for folks like me who eschew standardizing predictors, but who also must navigate the curse of dimensionality?

2 Likes

This is a fundamental question that deserves much discussion. Normalization is a weak spot in penalized maximum likelihood estimation (ridge, lasso, elastic net, etc.) and in selecting priors for a number of differently-scaled variables. In my view we all rush too quickly to scale by the standard deviation, which is (1) not driven by subject matter knowledge and (2) assumes that the variable’s distribution is symmetric. One could just as well argue for scaling by Gini’s mean difference or the inter-decile range. It is fairly clear that scaling needs to be used, but what is best practice in choosing the approach to scaling?

2 Likes

I found but have not yet read this recent paper on the topic in Bayesian Analysis: https://projecteuclid.org/download/pdfview_1/euclid.ba/1484103680

Sorry for digging up this old topic.

I recently encountered this problem in my Bayesian Model, so I think it is good place to ask (I asked on StackExchange as well generalized linear model - Scaling factor for binary and quadratic terms in penalized regression - Cross Validated (stackexchange.com).

I agree that scaling in penalised regression is a must, otherwise it would fail desperately. However, creating a “fair” situation between different covariates is quite tricky.

I don’t think the automatic scaling by standard deviation make very good sense, esp. when data distribution is clearly asymmetric. One obvious case is the uncontrollable inflation of coefficients for rare events (as mentioned in the StackExchange question). Gelman in his book (Data Analysis Using Regression and Multilevel Hierarchical Models, 2007) scaled continuous variables down by 2sd while kept binary ones intact; that also contain some arbitrary level (i.e. why not centering the binary to [-.5, .5] and scale up by 2 to make a full range [-1,1] = 95% inter-percentile range of aforementioned 2sd-scaled continuous variables).

I wonder if there has been any development in this field so far?

For this general question you’re likely to get more replies from stats.stackexchange.com. This issue is the bugaboo of penalized maximum likelihood estimation and Bayesian modeling. The scaling as you so well said is so arbitrary, and once you have nonlinear terms in the model it’s even more difficult. Some progress may be made by considering other basis functions. Look at how Richard McElreath handles this in his book Statistical Rethinking and how the R brms package handles nonlinear covariate effects.

I really need to get the physical copy of Statistical Rethinking, because I often have to reference it, and it’s difficult to do in the digital copy. Time to buy a book.