Penalized Likelihood vs. Variance Decomposition Approaches

Lohmann, Groenwold, and van Smeden have just published the kind of paper I have been waiting for a long time: Comparison of likelihood penalization and variance decomposition approaches for clinical prediction models: A simulation study. Comparative studies of commonly used model-building strategies in binary logistic regression are extremely valuable, and this research sets great examples with regard to choice of analytical approaches, researcher-unbiasedness of assessments of predictive performance, variety of data configurations examined, meta-simulation analysis, reproducible research, and open access.

In terms of the performance measures evaluated in the paper I give the most emphasis to the root mean squared estimation error, which is the square root of the average (over observations) squared difference between predicted and true probabilities. I do not believe that the c-index should have been emphasized because it is not very sensitive for comparing models despite its usefulness for quantifying pure predictive discrimination for a single model.

I have long believed that data reduction (unsupervised learning) approaches such as incomplete principal components analysis (PCA) are good defaults because of their wide applicability over all types of outcome variables and because they work well with longitudinal data also. PCA has major advantages over lasso and elastic net in not misleading analysts about the ability to stably select predictors especially when colllinearities are present. So the paper is worrying me a bit about reliance on data reduction methods.

If someone asked me what is the leading data reduction technique to include in a simulation study it would be sparse PCA because it essentially does variable clustering while doing PCA, using a lasso penalty to make the PCs sparse and more interpretable, and better handling collinearities. Would the authors consider running more simulations to study the performance of sparse PCA?

The paper mentions several PCA strategies but only two were included in the paper. I would have emphasized the strategy that tries to apply a cutoff for variance explained and includes the first k components in a standard maximum likelihood estimation step. How did performance of the reported PCA strategies differ from the ones not reported?

One of the two PCA strategies reported was based on the use of AIC to select the number of components. Did the selection properly force the components to be used in order of decreasing variance explained?

I’m not clear why relaxed lasso was mentioned because it does not have face validity with regard to calibration-in-the-small since the carefully controlled shrinkage in the initial ordinary lasso feature selection step is not inherited by the second estimation step.

Data reduction and shrinkage are most necessary when the number of candidate predictors is large in relationship to the effective sample size 3np(1-p) where p is the proportion of Y=1 (see this). Can the authors provide a graph depicting the root mean squared estimation error (computed against the true probabilities) where something like effective sample size divided by the number of candidate features is on the x-axis?

:new: Related to using AIC to select the first k PCs it would be interesting to see how it works to select k with regard to the effective sample size.

Thanks for much to the authors for writing a truly great paper and for considering these requests to provide supplemental information.

7 Likes

Hi,
I’ve been working with a cox PH model, comparing the results of the full_model with those obtained after a PCA.

With regards your quote, I tried to reproduce the same with my model/data and plot equivalent results as shown in Fig.8.5 of RMS book.

In my case, I have 25 covariates (6 of them are categorical)

Question: I am facing a problem with the results I get when comparing AIC of the full model (with restricted cubic splines) with original covariates and the transformed (transcan()) PC model.
Generally, I understand that as more PC components used in the model, the better (lower) AIC and we can improve the AIC with less components than those used with the original covariates.
My problem is that I do not get any model (fitted with transformed PC components) with better AIC than the full_model with original covariates. (rcs means that the model considers restricted cubic splines)
image

Can be a possible explanation that by transforming the covariates we loose some information contained in the categorical ones as we transform them into dummy variables?

I make as well this suggestion because the PCA applied to raw variables can reach the AIC obtained with the full_model with a certain number of components.

The simplest explanation (if true) is that the most flexible model works best and you have a very large effective sample size (very large number of events).