The typical advice when using common regularization techniques like LASSO or ridge is to standardize the predictors before fitting the model since the penalties are a function of the size of the coefficients, which depends on the scale of the data. Yet I don’t like to standardize predictors because I don’t like my estimates to be confounded by the mean and standard deviation of the predictor, which are dependent on the sample I use to fit the model. If I use Bayesian methods instead, well there again the experts like Gelman et al. suggest standardization to make default priors possible when there isn’t much prior information available. What is a workable solution for folks like me who eschew standardizing predictors, but who also must navigate the curse of dimensionality?
This is a fundamental question that deserves much discussion. Normalization is a weak spot in penalized maximum likelihood estimation (ridge, lasso, elastic net, etc.) and in selecting priors for a number of differently-scaled variables. In my view we all rush too quickly to scale by the standard deviation, which is (1) not driven by subject matter knowledge and (2) assumes that the variable’s distribution is symmetric. One could just as well argue for scaling by Gini’s mean difference or the inter-decile range. It is fairly clear that scaling needs to be used, but what is best practice in choosing the approach to scaling?
I found but have not yet read this recent paper on the topic in Bayesian Analysis: https://projecteuclid.org/download/pdfview_1/euclid.ba/1484103680