# Regression Modeling Strategies: Case Study in Data Reduction

This is the eighth of several several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

## Overview | Course Notes

RMS8

I have a question, which I do not know if it’s trivial or not, but I do need to clarify it and I would be very grateful if someone could help me.

I want to fit a Cox PH Model with a set of covariates, some are continous and others categorical (3 are binary - 0/1 and 2 have more than 2 categories).

One of the assumptions of CoxPH regressions is linearity of the covariates vs log-hazard ratio.

My questions are:

1- If the continous covariates are linear vs log-hazard ratio (e.g. raw-linear or splined to achieve linearity), will the PC’s obtained by fitting a PCA, preserve the linearity vs log-hazard ratio?

2- If we include the categorical covariates in the PCA, would this altere the linearity condition of the PC’s? I assume that binary covariates may preserve it (if condition 1 is TRUE) as they are linear vs the outcome (only 2 possible values) but problems could arise when including categorical covariates with more than 2 categories.

With regards to linearity in PCA there is an interesting discussion here:https://stats.stackexchange.com/questions/290750/linearity-of-pca/646786#646786

Thank you so much for your help.

1 Like

Good questions. First first one is easier: these linearity is preserved, it’s just that the slope will be attenuated due to the restrictions placed by using a subset of the PCs and not all possible PCs. Regarding the second question, regular PCA only approximately handles categorical variables, and is based on expanding k levels into k-1 indicator variables. Linearity is not really an issue there; the issue is whether the categorical variables are scored in a reasonable fashion when using a method that was developed for multivariate continuous variables.

1 Like

1.- Let’s see what happen if the dataset is transformed with restricted cubic splines (to relax linearity restrictions). Later we want to compute PCA, should we force the algorithm to group all the terms(ax+bx2+cx3…) of each spline in the Principal components?. In other words, my concern is that it’s possible that some PC’s use only some terms of splined covariates, their effect is then splitted in different PC’s. If this happens, we may loose the linearity later when fitting the new model with the required PC’s.
2.- When dealing with categorical variables, I understand that you mean to ‘weight’ each level of the categorical variable so their impacts will be better accounted when computing the PC’s?

3.- An idea could only perform PCA of continous covariates and leave out of this analysis the categorical ones. Later, fit the model with the required PC’s and all the categorical covariates. We increase d.f. but it may be worth it.

Thank you!
Marc

Instead of that, investigate nonlinear principal components as done in the case study for this chapter.

Hi there,
with regards this question I have investigated a little bit more:

The idea is to compute separately a PCA for each splined covariate and take the PC1 for each. After, we can compute the PCA for the whole set of variables, thus preserving the integrity of the information contained in each splined covariate.

This is how I think it would be possible to proceed:

PCA of splines

`prin.trans.splines <- princmp(~ rcs(SDT,6) + rcs(YEAR,3) + TYPE3 + FLUID2 + DOUBLE.SUCTION + TEMP + rcs(DISCH.PRESS,5) + rcs(RPM,4) + rcs(POWER,4) + rcs(VIBRATIONS,5) + SEAL.ARRGT + SEAL.TYPE2 + BOTTOM + FLOW.RATIO + NPSH.MARGIN + rcs(DIN.VISCOSITY,3) + rcs(VAPOR.P,3) + rcs(TIP.SPEED,3) + rcs(RATIO.DIAMETER,4) + rcs(EFFICIENCY,3) + rcs(Ns,6) + rcs(STABLE,5) + rcs(LUBE,6),k=55, sw=TRUE, data=g_trans)`

PCA of PCA of the splines

`prin.trans.splines <- princmp(~ princmp(~ rcs(SDT,6), data=g_trans, k=1)\$scores + princmp(~ rcs(YEAR,3), data=g_trans, k=1)\$scores + TYPE3 + FLUID2 + DOUBLE.SUCTION + TEMP + princmp(~ rcs(DISCH.PRESS,5), data=g_trans, k=1)\$scores + rcs(RPM,4) + princmp(~ rcs(POWER,4), data=g_trans, k=1)\$scores + princmp(~ rcs(VIBRATIONS,5), data=g_trans, k=1)\$scores + SEAL.ARRGT + SEAL.TYPE2 + BOTTOM + FLOW.RATIO + NPSH.MARGIN + princmp(~ rcs(DIN.VISCOSITY,3), data=g_trans, k=1)\$scores + princmp(~ rcs(VAPOR.P,3), data=g_trans, k=1)\$scores + princmp(~ rcs(TIP.SPEED,3), data=g_trans, k=1)\$scores + princmp(~ rcs(RATIO.DIAMETER,4), data=g_trans, k=1)\$scores + princmp(~ rcs(EFFICIENCY,3), data=g_trans, k=1)\$scores + princmp(~ rcs(Ns,6), data=g_trans, k=1)\$scores + princmp(~ rcs(STABLE,5), data=g_trans, k=1)\$scores + princmp(~ rcs(LUBE,6), data=g_trans, k=1)\$scores,k=55, sw=TRUE, data=g_trans)`

I’m not sure. I think I’d rather use a more formal nonlinear PCA approach as with `Hmisc::transcan`.