Simulating survival data to evaluate statistical power for survival analysis

I’m currently working on an informal simulation study to show the dangers of dichotomizing a continuous variable. I’ve been working a lot lately with genetic and molecular prognostic markers, and the literature is filled with studies that divide patients into low expression and high expression groups and then compare their survival differences. I’ve encountered many researchers that seem to think this is the proper way to analyze continuous variables, justified in part because so many papers dichotomize.

The goal of the simulation study is to demonstrate that using continuous methods (Cox regression) have higher statistical power than dichotomous methods (dichotomized Kaplan Meier analysis). I simulate data from an exponential distribution using formula derived by Bender et al. and code from an example in 18.3.3 in the BBR notes. The variable of interest is a continuous covariate generated from a standard normal distribution that linearly affects the hazard function. My results for data simulated with \lambda_0=.05 and maximum time of 30 months are shown below


Figure 1. Statistical power calculation comparing median cutpoint Kaplan Meier analysis (Median KM) with linear and nonlinear Cox regression. Error bars are 95% confidence intervals. Dashed red line represents 80% power. HR=Hazard Ratio.

For each N and hazard ratio combination, I simulated datasets from 5 different seeds to quantify the uncertainty surrounding the data generating seed that is chosen. For each seed, I generated 200 datasets, fit models to each dataset (median cutpoint-log rank test, linear and nonlinear Cox regressions) and computed the power of each model for that seed. Using \alpha=.05, power was calculated by counting the number of datasets where p<.05 and dividing by 200 (the total number of datasets for that seed). This yielded 5 point estimates of power, allowing me to compute 95% confidence intervals (error bars on the graph). The median was used for the dichotomous model, and a restricted cubic spline with 3 knots from the rcs function in rms was used for the nonlinear Cox model.

Although my results recapitulate the statistical literature that states continuous methods will have more power than dichotomous methods, I was surprised by the relatively poor performance of the nonlinear model compared to the linear model. The poor performance doesn’t really make sense to me, as Harrell states in Regression Modeling Strategies Section 2.4

A better approach that maximizes power and that only assumes a smooth relationship is to use regression splines for predictors that are not known to predict linearly.

Is the nonlinear Cox model performing poorly because the continuous covariate is linearly related to survival? If the true relationship is linear, should we expect subpar power from a nonlinear model? Is my simulation flawed, and we actually expect the nonlinear Cox model to have equivalent power to the linear Cox model? In the field it often makes more sense to assume less and just treat continuous variables as nonlinear, but I’m curious if we could actually be harming our study by using a nonlinear method when the true relationship is linear. I’m also interested to hear thoughts on whether I should be using a more complex distribution like Weibull or Gompertz.

Here’s my code:

# File: simulations.R
library(survival)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(rms)

simulate_data <- function(N, maxtime, beta, baseline_hazard, seed) {
  # See Bender et al. 2005 for more information on parametric survival simulation
  # Assume continuous covariate follows a standard normal distribution
  covariate_mean <- 0
  covariate_sd <- 1
  covariate <- rnorm(N, covariate_mean, covariate_sd) # create covariate
  cens <- maxtime * runif(N) # create censoring times (unrelated to survival)
  h <- baseline_hazard * exp(beta * (covariate - covariate_mean))
  dt <- -log(runif(N)) / h # Bender formula for times T ~ Exponential(baseline_hazard)
  e <- ifelse(dt <= cens, 1, 0)
  dt <- pmin(dt, cens)
  S <- Surv(dt, e)
  linear.fit <- coxph(S ~ covariate)
  median.fit <- coxph(S ~ covariate > median(covariate))
  nonlinear.fit <- cph(S ~ rcs(covariate, 3))
  linear.pvalue <- coef(summary(linear.fit))[, 5]
  median.pvalue <- coef(summary(median.fit))[, 5]
  nonlinear.pvalue <- anova(nonlinear.fit)['covariate', 'P']
  c(linear = linear.pvalue,
    median = median.pvalue,
    nonlinear = nonlinear.pvalue,
    samples = N,
    beta = beta,
    seed = seed)
}

simulate_trial <- function(n_sim, N, maxtime, beta, baseline, seed) {
  set.seed(seed)
  results <- replicate(n_sim, simulate_data(N, maxtime, beta, baseline, seed))
  data.frame(t(results))
}

build_design_matrix <- function(initial_seed, num_seeds, sample_sizes, betas) {
  set.seed(initial_seed)
  seeds <- sample.int(100000, num_seeds)
  design <- expand.grid(
    N = sample_sizes,
    logHR = betas,
    seed = seeds
  )
}

compute_power <- function(simulation_results) {
  power_info <- simulation_results %>%
    gather("linear", "median", "nonlinear", 
           key="model", value="pvalue") %>%
    mutate(model = factor(model)) %>%
    mutate(model = fct_recode(model,
                              `Linear Cox` = "linear",
                              `Median KM` = "median",
                              `Nonlinear Cox` = "nonlinear"
                              )) %>%
    group_by(model, samples, beta, seed) %>%
    summarise(power = sum(pvalue < .05) / n()) %>%
    group_by(model, samples, beta) %>%
    summarise(mean_power = mean(power),
              se = sd(power) / sqrt(n()),
              lower = mean_power - se,
              upper = mean_power + se,
              lower_ci = mean_power - 1.96 * se,
              upper_ci = mean_power + 1.96 * se) %>%
    mutate(HR = paste("HR =", exp(beta)))
}

plot_results <- function(power_results) {
  # Plots 95% confidence intervals, NOT standard error
  power_plot <- power_results %>% 
    ggplot(aes(samples, mean_power)) + 
    geom_point() + 
    geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), width=20) +
    geom_line(aes(color = factor(model))) + 
    facet_wrap(~HR) +
    geom_hline(yintercept = .8, color="red", linetype="dashed") +
    labs(x = "Samples", y = "Power") +
    guides(color=guide_legend(title="Method")) +
    scale_x_continuous("Samples", breaks=seq(0, 700, by=100), limits=c(0, 700))
}

run_simulation <- function(save_figure=F) {
  n_sims <- 200
  maxtime <- 30
  baseline <- .05
  
  sample_sizes <- c(100, 200, 300, 400, 500, 600, 700)
  betas <- log(c(1.0, 1.1, 1.15, 1.2, 1.25, 1.3))
  setup_seed <- 42
  n_seeds <- 5
  
  design <- build_design_matrix(setup_seed, n_seeds, sample_sizes, betas)
  results <- design %>%
    rowwise() %>%
    do(simulate_trial(n_sims, .$N, maxtime, .$logHR, baseline, .$seed))
  power_results <- compute_power(results)
  power_plot <- plot_results(power_results)
  print(power_plot)
  if(save_figure) {
    ggsave(file.path(getwd(), "power_plot.png"), width = 12, height = 8)
  }
}

run_simulation(save_figure=T)
4 Likes

A wonderful post and excellent work It would be good to refer to simpler cases showing the same thing with binary Y as in an interactive demo I have in RMS.

When you condition on the true covariate relationship having a linear effect, and you don’t use Bayesian methods that tilt the prior towards linearity, you will lose a bit of power as you nicely demonstrated. The critical value for a \chi^2 test with 2 d.f. exceeds the critical value for a test with 1 d.f. The value of allowing nonlinearity will kick in if you allow for the possibility of nonlinearity in how the data are generated. But note that the more flexible 2 d.f. model where 1 d.f. was unneeded is still more powerful than the dichotomized model which requires a massive assumption (why dichotomize at the median) and gives inappropriate predicted values (step function with zero-order discontinuity).

2 Likes