When is censoring too high?


i feel that there must be a better reference than this:

Censoring of survival data is also of important influence on research result. Too high rate of censor will be lower accuracy and effectiveness of analysis result of an analytical model, increasing risk of bias. Hence, the rates of censor should be reported in articles. The result shows no articles report the censoring rate, but many articles have the phenomenon of excessive rate of censoring. For example, the calculation shows the study done by Xuexia et al[32] has censored rate up to 84%, severely influencing the results.

the paper is here: Reporting and methodological quality of survival analysis in articles published in Chinese oncology journals. But their reference 32 is impossible to track down. I am dealing with datasets with massive N and thus the censoring rate is high. My impulse is to use logistic regression but the convention within the literature is to apply Cox regression. Any thoughts, opinions or papers you can point to eg using simulations to evaluate how methods perform? cheers

edit: and here is another dubious source, an antiquated math message board (query from 2007):

When there is a large proportion of censored observations, the
likelihood function is skewed towards parameter values which yield long
lifetimes. Also, the likelihood function is less sharply peaked (i.e. greater
range of plausible values). These two features make conventional maximum
likelihood inference less useful.
My advice to you is to take a Bayesian approach. In addition to the
likelihood function, it may be possible to exploit information from
the shape of the age distribution of the population at present; the age
distribution is an observable consequence of the survival function
(and the birth rate, if that is not stationary), and so it should be
possible to incorporate the present age distribution in the likelihood
function. (You heard it here first, folks. A few years ago I tried to follow
through on this idea and didn’t get far.)



I don’t get any of that. If over a study period [0, \tau] the probability of an event is 0.01, you’ll have 0.99 censored. The likelihood function is still perfect, and the survival curve S(t) within that interval may be very well estimated. The log-likelihood may become more non-quadratic so there is more need for likelihood ratio tests instead of Wald tests.

At issue with censoring is

  • censoring needs to happen for a reason unrelated to impending risk of the event and
  • when the total sample size is small or the number of uncensored observations is small, one cannot estimate the hazard rate with precision and one cannot do relative comparisons

Think about a study in which 10,000 patients are followed for 1y and no one has an event. The survival curve S(t) is 1.0 until at least 1y and the compatibility interval for it will be very, very tight. This will be a great outcome. It’s just that you can’t compare two treatments on a relative scale (e.g., hazard ratio) when no one has an event. But a confidence interval for the difference in survival probability will work just fine.

Don’t switch to logistic regression because of censoring. Logistic regression has no way to handle variable follow-up time, and loses time-to-event information.



really appreciate your comments. will respond more fully when i have thought more about it and looked into it further. i stumbled upon this simulation study relating to the non-PH situation: Bias Of The Cox Model Hazard Ratio:

For random and late censoring the estimate decreases (higher bias) with increasing censoring proportion but remains stable relative to sample size.

1 Like


Good point. I think this is a bit more of a goodness of fit issue than a pure censoring issue. This is a different issue, but I think in general we need to move to capturing uncertainty in proportional hazards, proportional odds, normality, etc. using full Bayesian models.