Representative Covariate Settings

Suppose one wants to estimate a treatment effect in a covariate-adjusted model. When the treatment does not interact with the other covariates, the treatment effect is independent of all covariates as long as you stay on the linear predictor scale. For a Cox proportional hazards model or a logistic model, treatment effect ratios (hazard ratio or odds ratio) are also covariate-independent. But when we want to estimate other quantities such as absolute risk reduction due to treatment or difference in median survival times, the nonlinear transformations involved make the result covariate-dependent.

I am seeming an approach to finding covariate settings that yield representative results for general nonlinear transformations. For example, what is a vector X of covariate values for 4 covariates with each of the values near its marginal covariate median such that the predicted probability that Y=1 | X equals the non-covariate adjusted predicted probability (i.e., dependent only on treatment)?

Note that it doesn’t work to set all the covariates to the median or mean, as these combinations may not occur in the data for various reasons including collinearity.

Here is one algorithm. It also can create covariate settings that don’t occur, but allows you to stay somewhat close to the marginal medians and has an adjustable parameter. I wonder if anyone has a better one.

  1. Consider all quantiles q from 0.35, 0.36, 0.37, …, 0.65
  2. Compute the q’th marginal quantile of each covariate that has an increasing relationship with Y, and the (1-q)'th quantile of each covariate that has a decreasing relationship with Y
  3. Evaluate the covariate-adjusted predicted risk at these covariate quantiles, for the control treatment arm
  4. Compute the absolute difference between this predicted risk and the marginal (non-covariate-adjusted) risk
  5. Choose the value of q that minimizes this absolute difference
  6. Save the q and (1-q) quantiles (depending on covariate direction)
  7. Use these covariate values to get desired example estimates

Here is another possible algorithm—one that will find “real” combinations:

  1. Compute predicted absolute risk for all covariate combinations occuring in the data
  2. Find all of the combinations with absolute difference between predicted risk and target marginal risk less than \epsilon for some \epsilon
  3. Choose the single covariate combination that is in some sense near the center of the data

:new: When the covariates are categorical and few in number, the following approach could be tried:

  1. Find covariate combinations for which more than 10 patients have that combination
  2. Sort the remaining combinations in descending order of frequency
  3. For each combination show the absolute difference between that combination’s predicted risk and the target average risk
  4. Choose the combination that is the “biggest bang for the buck”, i.e., that has the best tradeoff between absolute difference and cell frequency.

Keep in mind that for covariate-dependent quantities we generate a series of estimates varying covariate settings. When curves do not cross, statistical evidence for a treatment effect will probably not vary very much over the different choices of X. But for the moment I am seeking a vector of X that would be useful for summarizing covariate-adjusted results for derived parameters.

:new: Terry Therneau et al. have some related thoughts. See also this, which discusses averaging over covariate distributions rather than making predictions for individual covariate vectors.


i have had this problem recently, mostly when communicating results eg a plot of cumulative incidence with a footnote indicating the covariate values assumed often leads to the Q: why those values. But i am dealing with mostly categorical covariates. I usually display curves for different levels of the covariate but the plot becomes cluttered. I wonder if soon we will have interactive figures in journals where the reader can slide the value of the covariate along a dial (for continuous vars) to see how the estimates move, i think we are starting to see that…

1 Like

An interactive display would be far better, although for the problem I’m currently working on the computation would take too long for each new covariate vector.

I think that the second method I listed is preferred when most of the covariates in question are continuous. For the categorical case I’m adding a thought to what’s above.