Last week I got really confused by 2 senior “statisticians” (more statisticians by “training” due to using methods a lot and from my perspective with not the best fundamental knowledge, but anyways).
In our clinic we have a data base with 10 years follow-up of patients. As I am currently working on a interesting biomarker, my doctoral supervisor thought that it would be a cool idea, that I could validate my model with the biomarker in this clinical cohort.
As we have no general access to this data base, we had a meeting with those 2 senior researchers and some of the investigators. We introduced our idea of using survival analysis (multiv. cox prop. hazards modeling) to investigate the question whether the biomarker is a marker for events, independently of some common factors.
Then the following happened:
Both told me very sharply, that …
- … using survival analysis on a “big cohort” would increase the chance of a random finding.
- … it would be much better to use a “nested case-control study”-design (with matching 1:2), cause then I would not need all patients, and this would be statistically better.
- … the problem with the data base would be, that if they would hand over me all values of the biomarker for all patients, and I would use them for an analysis, they can’t do any other analysis with those values anymore.
I do not understand either the first, second or the third part of this statement.
In the case of the first, I am completely confused, why?
In the case of the second, I don’t understand the added value of such an analysis (nested case-control) at all. The time factor, which is included in the survival analysis, will be lost, right? Especially concerning performance (added value by using Chi-squared and computing fraction of new information), this neglect of time should lead to distortions of the results?
Regarding the third, I can only imagine the problem of multiple testing, which is not relevant in this setting of a concrete question and fully determined statistical analysis plan, or am I wrong here?
In the last week I read so much, but I could not find any solution to my confusion or answers where I am thinking wrong…
I would be really happy and you would help me really much if you could help me get a clearer view on this comments.
Thank you really much in advance!
the currently confused Tom
These statisticians are wrong on all counts.
The purpose of a nested case-control study would be to approximate the cohort survival analysis without requiring access to the full dataset. With random sampling from the study base, you can do this validly, but at the cost of limiting yourself to estimating the hazard ratio (or alternatively, the odds ratio for cumulative incidence). Conducting the full cohort analysis is always preferable, but you won’t lose much efficiency by using a case-control design, and you will estimate the same thing.
The benefit of case-control sampling occurs in settings where the exposure variable is not yet entered into the dataset, and where collecting this information in everybody would be costly. In such settings, you can do the study at a fraction of the cost by using the case-control design instead of cohort survival analysis (without changing the research question or the interpretation of the estimates).
But in this case, if I understand correctly, data on the exposure variable has already been collected. Case-control sampling would therefore not save you any resources.
To be clear:
(1) Cohort analysis does not increase alpha relative to case-control
(2) Case-control is in fact more statistically efficient (for the purposes of estimating the hazard ratio) for a fixed n. That is, if your choice is between having a case-contol study with n=1000 with 500 cases and 500 controls, this gives more information than having n=1000 in a cohort study where only 100 people get the disease. However, this framing is disingenious when the choice is not between n=1000 in a case-control study and n=1000 in a cohort study, but rather between n=1000 in a case-control study and n=10000 in a cohort study. If you can easily just get access to all the 10 000 people, you get more information from analyzing all the data.
(3) The implications of multiple testing in such settings are complicated, but will not be impacted by whether you use the case-control or cohort design
Perhaps the statisticians are worried that, if they gave you access to the entire dataset, you would be able to use it for other purposes, such as studying other outcome variables? Maybe they would like to “save” those other outcome variables for future students?
In addition to Anders’ excellent points, I fear that the statisticians had fallen into the trap of being obsessed with significance testing. As is common practice in biomarker research (which is a great pity) most researchers are OK with finding a marker that has a “significant” association with outcome after (inadequate) adjustment for other information. It is correct that very large samples can find “significance” even with the predictive signal is small. For this reason, such research should in my view concentrate on quantifying the predictive discrimination in the marker. Discrimination measures such as pseudo R^2 are not functions of sample size (assuming overfitting is not present).
If there is an issue of holding back data in order to not profit from cherry-picking / overfitting, especially when it’s not the case that only a few candidate markers have been pre-specified, resampling methods in which all supervised learning steps are repeated afresh for each resample are usually the best ways to estimate likely future performance of models containing the markers.
In addition to the excellent points by Anders and Frank:
The benefit of a nested case-control study is, that it is 1) nested into a cohort (meaning that estimates form the nested study generalize to the entire cohort), and 2) expensive data such as biomarkers are only needed on the nested data and not the entire cohort.
I will not speculate on motives, as I have not heard the two senior statisticians, but it seems evident that there is miscommunication. Or as Anders suggests, your study overlaps with other promises. Maybe it is not nested case-control design but an ordinary case control design – maybe the statisticians are worried about some other design aspect. Trying to be an advocate for two friends, however, I have several times encountered a situation, where a researcher wants to apply survival analysis on data that are not suited – ie on data that are some conditioned on the future – ie on individuals being, sampled, alive or exposed at some future time point.
I do not know what you mean by “statisticians”. However, a person who is trained in statistical methods can easily understand survival analysis and a nested case-control design and when they are appropriate.