I’m new to datamethods but I wondered if I could check that I’ve understood a few things correctly.

I’m planning to use data from patient files over 5 years to do a survival analysis of patients with cervical cancer. I have 170 patients and expect I will have an event for about 100 of them. Stats is not my strong point so I wanted to check a few questions about the statistics I’m planning to do:

I am interested in comparing the survival curves between patients based on multiple different variables – age, parity, HIV status, whether they received chemotherapy etc. I have 7 variables I’m interested in total. If I run a cox proportional hazards regression, am I correct in thinking I will need at least 80 events? (to ensure I have a minimum of 10 events per variable)

Can I also check I’ve understood Cox proportional hazards regression correctly – I don’t need to correct for doing multiple analyses despite having multiple variables that will be compared? (I’m struggling a bit to understand the difference between log rank and cox regression - from what I’ve read, you never really need to do log rank? But I had initially wondered about doing a log rank test for the survival curves for each variable, which is what made me think maybe multiple analyses would be a problem).

From what I’ve read, I should be checking my investigation is adequately powered. I’ve been advised to use G*power for this, but I’m confused because I need an R2 value for my variables in order to do this. How can I calculate the R2 before I do the regression analysis? Is it based on previous findings by others? If I can’t find any R2 values (which I can’t so far!) can I do a power analysis at the end once I’ve done the regression instead?

Throwing all your variables into one regression model does not circumvent the issue of multiple comparisons. If that’s your motivation for putting all the variables into one regression model, then you should rethink that.

From what I’ve seen, G*power does not have support for Cox proportional hazards regression, so how are you using it to do a power analysis in this case?

For the multiple comparisons - am I right in thinking it is a balance between putting enough variables in the model to not miss a big contributing factor out, and putting in so many variables that multiple comparisons becomes an issue? Is there any advice about how to balance this, or just a case of trying to make a sensible decision?

Ah I see. The site you linked to is showing G*power’s functionality for doing power analysis for multiple linear regression. This is not going to be the same as a power analysis for a Cox proportional hazards model. Different types of regression (linear, logistic, Poisson, Cox, etc) all make different assumptions about the underlying data and estimate different parameters, so will in general need their own methods for power analysis. And as you’ve already seen, power analyses for any type of multiple regression model are going to be more complicated than the standard unadjusted power analysis methods that textbooks tend to focus on.

Not necessarily. Remember the purpose of correcting for multiple comparisons is to control an overall Type 1 error for a set (also called a family) of hypothesis tests that you want to perform. How many tests you do and which ones should be grouped together into a family of tests for the purpose of correcting for multiple comparisons depends on the goals of the analysis, which should be based on the underlying scientific question(s) of interest. So someone fitting a regression model with 7 predictors might do a correction for 7 tests, they might correct for doing 2 (or 3 or 4) tests, or they might not make any corrections for multiple comparisons at all. Or they might forego hypothesis testing all together. It all depends on what the goal of their study is. So let’s take a step back and address this question: what is the goal of your study? What scientific question(s) are you looking to answer?