Trying variable clustering followed by PC1 scoring

After reading about multivariable modeling and data reduction, I wanted to try using variable clustering to model my high-dimensional data. I understand the idea is to use varclus() to get clusters of variables that are masked to Y, then extract the first principal component of each of those clusters, and then use a PO model to see how those PC1s relate to Y.

My main question is whether there is a threshold for explained variance that the PC1s need to pass. For example, say I get 10 variable clusters with different numbers of variables. When I use PCA on each cluster I get some PC1s that I’m confortable using because they explain 60-70% of the variance, but then there’s some that only explain 40-50% of the variance. What do I do then? Should I use more than one principal component for those clusters? Do I try with a larger or a smaller number of clusters until I get PC1s that explain, say >50% of the variance? Do I simply ignore it and use the PC1 regardless of how much of the cluster’s variance they explain?

1 Like

Sparse PCA may be considered instead as it combines clustering and PCA.

I don’t think that a threshold on explained variation is the way to go. Rather we may do a sample size calculation to estimate the number of variables p that can reliably be played against Y, then use the first p sparse or regular PCs.

1 Like

Thank you for responding Professor. I thought either approach was alright. Is sparse PCA always the better option? I’m sorry if asking about implementation is off-topic, but I do find a problem when trying to perform sPCA on my data. I get a warning about linear dependencies:

Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax,  :
  471  linear dependencies found

I do get some output, but the warning makes me worried that I’m using the wrong method. I’ve tried changing nvmax and kapprox, but I think the problem is in the size of the data and how correlated it all is.

spca <- princmp(lipidomics_transf[-c(1:2)], sw = TRUE ,method = "sparse", cor = FALSE, nvmax = 10)

Sparse Principal Components Analysis

Stepwise Approximations to PCs With Cumulative R^2

PC 1 
TAG.53.4_FA16.0. (0.901) + TAG.50.4_FA18.1. (0.95) + PC.16.0_18.2. (0.971) + DAG.16.0_20.5. (0.98) +
TAG.52.0_FA16.0. (0.986) + LPE.18.0. (0.988) + TAG.58.7_FA20.4. (0.99) + TAG.49.1_FA14.0. (0.992) +
TAG.50.3_FA16.0. (0.993) + TAG.54.2_FA20.0. (0.994)

PC 2 
PC.18.1_18.2. (0.76) + TAG.54.4_FA18.2. (0.908) + SM.26.0. (0.94) + PE.P.18.1_18.2. (0.954) +
PS.20.0_22.5. (0.962) + TAG.54.0_FA18.0. (0.967) + PI.16.0_20.3. (0.973) + LCER.16.0. (0.977) +
TAG.53.3_FA18.2. (0.98) + PE.P.18.0_18.2. (0.982)

PC 3 
TAG.58.9_FA22.6. (0.562) + TAG.46.0_FA18.0. (0.832) + TAG.58.8_FA18.2. (0.906) + FFA.18.0. (0.935) +
SM.20.1. (0.952) + TAG.47.2_FA18.2. (0.969) + PC.20.0_20.4. (0.974) + MAG.18.1. (0.979) +
TAG.44.1_FA16.1. (0.982) + TAG.52.4_FA18.3. (0.986)

PC 4 
TAG.54.7_FA18.3. (0.402) + DCER.24.1. (0.833) + PI.18.0_22.6. (0.866) + CER.18.0. (0.898) +
FFA.14.1. (0.917) + PC.16.0_18.1. (0.932) + PS.20.0_18.2. (0.943) + HCER.16.0. (0.952) +
TAG.56.6_FA20.2. (0.958) + PC.18.1_18.1. (0.963)

PC 5 
PS.18.0_20.0. (0.605) + SM.22.0. (0.748) + PC.18.1_20.5. (0.872) + PG.18.0_20.0. (0.893) +
TAG.54.8_FA20.4. (0.927) + PI.20.0_18.1. (0.944) + PG.20.0_22.5. (0.954) + MAG.18.0. (0.964) +
PI.18.0_18.0. (0.97) + PG.18.1_20.3. (0.976)

It’s lipidomics of about 400 patients and I’m trying to use about 900 individual lipid species as variables. The outcome is patient severity, which I’m keeping as an ordinal factor, though of course I’m not taking outcome into account for the sPCA part.

I could add up all the lipid species into their classes or subclasses and try sPCA on that, if you think using lipid species is too ambitious.

I’m sorry but I have not experimented with the SPCA function’s various parameters. If you find out more I hope you’ll report back here.