RMS Case Study in Data Reduction

f2harrell · May 9, 2024, 12:20pm

Regression Modeling Strategies: Case Study in Data Reduction

This is the eighth of several several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

Overview | Course Notes

Additional links

RMS8

Marc_Vila_Forteza · May 9, 2024, 12:39pm

I have a question, which I do not know if it’s trivial or not, but I do need to clarify it and I would be very grateful if someone could help me.

I want to fit a Cox PH Model with a set of covariates, some are continous and others categorical (3 are binary - 0/1 and 2 have more than 2 categories).

One of the assumptions of CoxPH regressions is linearity of the covariates vs log-hazard ratio.

My questions are:

1- If the continous covariates are linear vs log-hazard ratio (e.g. raw-linear or splined to achieve linearity), will the PC’s obtained by fitting a PCA, preserve the linearity vs log-hazard ratio?

2- If we include the categorical covariates in the PCA, would this altere the linearity condition of the PC’s? I assume that binary covariates may preserve it (if condition 1 is TRUE) as they are linear vs the outcome (only 2 possible values) but problems could arise when including categorical covariates with more than 2 categories.

With regards to linearity in PCA there is an interesting discussion here:https://stats.stackexchange.com/questions/290750/linearity-of-pca/646786#646786

Thank you so much for your help.

f2harrell · May 9, 2024, 6:40pm

Good questions. First first one is easier: these linearity is preserved, it’s just that the slope will be attenuated due to the restrictions placed by using a subset of the PCs and not all possible PCs. Regarding the second question, regular PCA only approximately handles categorical variables, and is based on expanding k levels into k-1 indicator variables. Linearity is not really an issue there; the issue is whether the categorical variables are scored in a reasonable fashion when using a method that was developed for multivariate continuous variables.

Marc_Vila_Forteza · May 10, 2024, 8:53am

Thank you professor for your answers. I have some remarks and additional issues to point out.

1.- Let’s see what happen if the dataset is transformed with restricted cubic splines (to relax linearity restrictions). Later we want to compute PCA, should we force the algorithm to group all the terms(ax+bx2+cx3…) of each spline in the Principal components?. In other words, my concern is that it’s possible that some PC’s use only some terms of splined covariates, their effect is then splitted in different PC’s. If this happens, we may loose the linearity later when fitting the new model with the required PC’s.
2.- When dealing with categorical variables, I understand that you mean to ‘weight’ each level of the categorical variable so their impacts will be better accounted when computing the PC’s?

3.- An idea could only perform PCA of continous covariates and leave out of this analysis the categorical ones. Later, fit the model with the required PC’s and all the categorical covariates. We increase d.f. but it may be worth it.

4.- Another issue that I think that could arise later as well is the proportionality of hazards, I do have concerns about this possibility. Do you have any thoughts about this?

Thank you!
Marc

f2harrell · May 10, 2024, 12:56pm

Instead of that, investigate nonlinear principal components as done in the case study for this chapter.

Marc_Vila_Forteza · May 16, 2024, 11:08am

Hi there,
with regards this question I have investigated a little bit more:

The idea is to compute separately a PCA for each splined covariate and take the PC1 for each. After, we can compute the PCA for the whole set of variables, thus preserving the integrity of the information contained in each splined covariate.

This is how I think it would be possible to proceed:

PCA of splines

prin.trans.splines <- princmp(~ rcs(SDT,6) + rcs(YEAR,3) + TYPE3 + FLUID2 + DOUBLE.SUCTION + TEMP + rcs(DISCH.PRESS,5) + rcs(RPM,4) + rcs(POWER,4) + rcs(VIBRATIONS,5) + SEAL.ARRGT + SEAL.TYPE2 + BOTTOM + FLOW.RATIO + NPSH.MARGIN + rcs(DIN.VISCOSITY,3) + rcs(VAPOR.P,3) + rcs(TIP.SPEED,3) + rcs(RATIO.DIAMETER,4) + rcs(EFFICIENCY,3) + rcs(Ns,6) + rcs(STABLE,5) + rcs(LUBE,6),k=55, sw=TRUE, data=g_trans)

PCA of PCA of the splines

prin.trans.splines <- princmp(~ princmp(~ rcs(SDT,6), data=g_trans, k=1)$scores + princmp(~ rcs(YEAR,3), data=g_trans, k=1)$scores + TYPE3 + FLUID2 + DOUBLE.SUCTION + TEMP + princmp(~ rcs(DISCH.PRESS,5), data=g_trans, k=1)$scores + rcs(RPM,4) + princmp(~ rcs(POWER,4), data=g_trans, k=1)$scores + princmp(~ rcs(VIBRATIONS,5), data=g_trans, k=1)$scores + SEAL.ARRGT + SEAL.TYPE2 + BOTTOM + FLOW.RATIO + NPSH.MARGIN + princmp(~ rcs(DIN.VISCOSITY,3), data=g_trans, k=1)$scores + princmp(~ rcs(VAPOR.P,3), data=g_trans, k=1)$scores + princmp(~ rcs(TIP.SPEED,3), data=g_trans, k=1)$scores + princmp(~ rcs(RATIO.DIAMETER,4), data=g_trans, k=1)$scores + princmp(~ rcs(EFFICIENCY,3), data=g_trans, k=1)$scores + princmp(~ rcs(Ns,6), data=g_trans, k=1)$scores + princmp(~ rcs(STABLE,5), data=g_trans, k=1)$scores + princmp(~ rcs(LUBE,6), data=g_trans, k=1)$scores,k=55, sw=TRUE, data=g_trans)

Any thoughts about this proposal?

Thank you!

f2harrell · May 16, 2024, 9:54pm

I’m not sure. I think I’d rather use a more formal nonlinear PCA approach as with Hmisc::transcan.