Sander Greenland and colleagues do a good job explaining how the standardization of predictors leads to confounded estimates of both correlational and causal effects: because standardization depends on the eccentricities of the training data. I wonder if, for similar reasons, data reduction methods (e.g., principal components regression) have the same issue and why.
I have that several other respond as this question has not been discussed in the literature to my knowledge. My first reaction is that unsupervised learning (data reduction ignoring Y) leads to less confounding than (arbitrarily parsimoneous) feature selection, but there still may be confounding. A predictor may be a confounder in a different way than how it relates to other predictors (say in a principal components setting). This speaks to having an overall strategy the right compromises. For example, one might do direct covariate adjustment for the pre-specified top 5 predictors (and splines of them, when continuous) and also adjust for the first 10 principal components of 50 other predictors thought to be possible confounders pre-study.
I have the same impression: adjustment for principal components may result in residual confounding. I have never used principal components, but I have the impression is a procedure based solely (mostly?) on statistical grounds. If that is correct, confounding variables and colliders may be combined in one of the principal components. Thus, in addition to incomplete control of confounding, adjusting for principal components could result in selection bias. On the other hand, having tens of potential confounding variables, in scenarios where our interest is to estimate causal effects, seems unlikely to me. In those cases, selecting variables on substantive grounds (e.g. using the backdoor criterion) should make manageable the number of conditioning variables needed to estimate the causal effect.
Could you link to the Greenland paper on standardization?
@f2harrell, based on what you’ve said, do you have any thoughts on supervised dimensionality reduction for a binary target? Supervised PCA and variations thereof (Bair’s, sparse, kernel supervised) can be quite powerful as pre-processing to a binary classifier. Some also use/consider Fisher’s Linear Discriminant (or Linear Discriminant Analysis as a special case) for dimension reduction as well. Thoughts?
No, Fisher’s linear discriminant analysis is not dimensionality reduction; it is fully supervised. Using it would be double dipping. And I don’t have any knowledge of semi-supervised methods but am usually suspicious of overfitting.
The (fully) supervised PCA (and kernel supervised PCA) methods I am referring to (Barshan and Ghodsi et al) are fully supervised. And I qualified the use of FLD/LDA as “some also”, as in: I don’t use FLD/LDA for that. By double-dipping, I assume you are referring to egregious and obvious mistakes (which I don’t make). It is a mistake to transform data before making separate training/validation/test sets and/or bootstraps and cross-validation folds. But I don’t do that. Supervised transformations (used properly) are the same as learning a model–they just happen to be separate from the main model rather than embedded into it. So, please let me know if there is anything else you are referring to.
We need to be clear on the definition of ‘dimensionality reduction’ and I take it to be the same as ‘data reduction’ which is strictly unsupervised learning.
Thanks @f2harrell. I needed to understand the concerns precisely.