Power and sample size planning with canonical correlation analysis

doctorShawniqua · July 22, 2024, 12:26am

I am planning a study in which I aim to identify relationships among multiple related outcome variables and multiple related independent variables. The observations (patients) can have one or more attributes (5 binary outcome variables in total) and will be assessed on a number of clinical characteristics (29 in total, mostly continuous variables, though there will likely be some collinearity and thus opportunity for variable reduction here). I understand that canonical correlation analysis can be a powerful method for ascertaining these relationships, and possibly better than running 5 separate regression models since the outcome attributes are intrinsically related. But trying to ascertain how much data I need to gather. I have read that a good rule of thumb for CCA is at least 10 observations per variable - this seems on the low end to me (just comparing to multi variable linear regression) unless… are the outcome variables included in this estimate? Ie if I do no data reduction, should I aim for (29+5)x10=340 participants completing the study? If the 10 per variable estimate is incorrect, can someone give guidance or a reference on how to go about planning for sample size and power in this context?

Thanks!

f2harrell · July 22, 2024, 11:33am

Canonical variates are made for continuous variates. I don’t know how well it performs for binary outcomes. The effective number of predictors is more than 9 + 25 but I haven’t seen a paper showing how to derive it. Typically we need more than 20 binary observations per variable, but simulation is needed to study some model performance measures. I hope someone can point us to some good literature.

Some time ago I looked at a bootstrap example of the volatility of the canonical variates, and found much more overfitting than multivariable univariate regression has.

davidcnorrismd · July 23, 2024, 9:40am

It seems to me that methods such as CCA and PCA belong more in the category of exploratory methods akin to data visualization, and that they cannot be viewed as methods “to find out or learn with certainty” [1]. Given this, questions of power might create distractions from the scientific opportunities.

If this is a study in your field of post-ICU cognitive decline, then are there any psychometric theories you can avail yourself of, to posit some latent factors and their nomological connections?

https://www.merriam-webster.com/dictionary/ascertain

f2harrell · July 23, 2024, 12:35pm

Good points. Sometimes I think of PCA and canonical variates as proof-of-concept or “signal existence” methods. They can be useful for reducing dimensionalities and can tell you something is there, but not what it is.