Predicted probabilities for the entire population with nested case-control studies

Following the last comments on a previous datamethods thread, I had doubts about how the prediction of the post-test probability of a binary logistic regression model would be calculated on the basis of a nested case-control design, so that the estimated probabilities would be similar to those of a model with the full cohort. In order for the question to be better understood, at the cost of lengthening, I propose an imaginary experiment.
Two residents try to predict deaths in critically ill patients admitted for infection. There are no censored data, deaths occur in 2-3 days, so the effect can be measured with logistic regression.
James reviews 1200 infections that were admitted in the last 10 years, and finds 120 deaths (10%), and develops a model with 6 predictors.
A different doctor, Henry, does not have time to review 1200 medical records, and after thinking about it for a while comes to the conclusion that it might not be necessary to compare 120 deaths against 1020 no deaths. To solve his time problem, he designs a case-control study nested on the complete cohort. That is, the cases are the 120 patients who die, but as controls randomly selects a group of patients who do not die, being the ratio 1:4. Enrique develops a model of binary logistic regression similar to that of James.
When Henry and James compare the predictions of their respective equations, the result differs. Although they both know that the death rate is 10%, in Henry’s series it represents 25% by sampling, so the intercepts differ.
What is the best solution for analyzing Henry’s data?

1 Like

If you are asking what I think you are asking, then the solution in logistic regression is to shift the intercept of the logistic regression model by the log of the ratio of sampling fractions: b=log(pr(S=1|case)/pr(S=1|control)). If LP is the linear predictor based on the (nested) case-control study, the risk is given by logit^{-1}(b+LP). This is a very cool property of the odds ratio. It is an important step towards understanding why we can use logistic regression for case control studies without explicitly acknowledging the very biased study design. This is written about in many papers including Anderson 1972, Prentice and Pyke 1979, and many others…

2 Likes

Thank you!!!
Indeed, that’s what I was asking. So I’m going to check if I’ve understood it with an experiment. In the hypothetical nested control case study, the probability of events is 0.255. I know (prior data) that in the general population, it is actually 0.134. Then, I calculate log(0,134/0,255), and the result is added to the intercept of the logistic equation obtained with the nested case-controls to modify the baseline log odds. For example, if the intercept of the nested control cases is equal to -1.3180, to calculate the proportion in the entire population I would have to apply as intercept of the logistic formula -1.3180 + log(0.134/0.255), leaving the rest of the terms of the linear predictor the same. That?

Nope. You need to use the log of the ratio of sampling probabilities (cases divided by controls) and not log of the ratio of event rates. In your example:

pr(S=1|case) = 120/120 = 1 since all cases are sampled
pr(S=1|control)= 480/1020 =0.47 since 480 of 1020 controls are sampled.

So, the offset (or intercept shift) is log(1/0.47) and this is the quantity you add to -1.3180.

Does this make sense?
js

Well… I’m not sure. Your interpretation of the question is perfect.
But in the complete cohort I am giving as an example, the linear predictor is -2,1340+ 1,0930 *x. So applying the logistic equation, the probability of dying with x=1 is 0,261, while patients with X= 0, have a probability of 0,1058. These are the probabilities that I approximately would like to obtain with the nested case-control study. Therefore, to try this I took a sample of the entire cohort as explained above and simulated the nested control-case desing, obtaining the equation is -1.3180 +0.900 x. So the intercept should decrease to approximate the predicted probability in the entire population. Do you mean that I should use -1.3180 - log(1/0,47) as intercept?? Is that the solution?

Yes! That is correct.

Thank you very much for spending part of your time teaching me a basic statistical question I didn’t know. I thought it was very nice. Now comes the hard part. So according to the calculation, the probability of death in the general population, estimated with the data from the nested case-control study, would be 25.7% with X=1, and 11.1% with X=0. This coincides quite well with the real percentages which are 26.1% with X=1 and 10.6% with X=0. The question now would be how to calculate the 95% confidence intervals for that prediction, bearing in mind that the standard errors are 0.209 for X, 0.114 for the intercept (both from the nested study), but I don’t know what should be considered for the standard error of the new intercept.