RMS Discussions

Thank you for your kind reply!
According to my understanding, the model resampling process will not affect the estimation of the restricted cubic spline, so this warning has no effect on my subsequent predicting process, like this

OR <- Predict(fit2, Totalcholesterol, fun=exp, ref.zero = TRUE)

Is this understanding correct?
Thanks!

Dear prof. Harrell,

Thank you so much for taking the time to review the code and provide the solution. Thank you also for the additional coding suggestions.
One minor question about the use of confidence bands instead of pointwise confidence intervals. Do you, in general, recommend the use of the bands instead of the pointwise CI? Also, seems that for the bands we control for multiplicity simultaneously for all values of the continuous covariate (100 values of age in the example above). Should not we just adjust for the multiplicity only for the number of parameters we are estimating, 5 coefficients in our case, to get the confidence band?

If I understand your question, yes — resampling doesn’t change the point estimates; it just tells you how unstable they are by looking a model and coefficient volatility across multiple model fits.

First we need to get airtight nomenclature. A confidence band can refer to either pointwise or simultaneous coverage. Band refers to the graphical styling. And yes we need to favor simultaneous bands over pointwise ones because in effect we are making a lot of comparisons. The beauty of simultaneous bands is that once you obtain estimates at n values of X the bands will be no wider than estimating at k values of X where k < n and k \geq p + 1 where p is the number of parameters estimated for the X effect. The dimensionality of the multiplicity is p, not the number of requested predictions n which can be very large and include many estimates that are just linear combinations of other estimates.

Thank you for this reply. Does this mean that my implementation of the simultaneous confidence band is not correct since I have included 1:100 values of age in the contrast function call (which, if I understand it correctly, will mean that the multiplicity is adjusted for all provided 100 values). Or the contrast function call actually adjusts for the number of independent parameters that are estimated in the fitted model?

You’ll see in the output that contrast puts an asterisk next to all the linearly redundant contrasts (which will be most of them) so all of this is taken care of automatically and you don’t have to worry.

This is really great, thank you so much! If I may expand a little on this, is it possible to construct a simultaneous confidence bands with a constant width at each point of the covariate x? In one sense it is natural that in the extremes of the range of the covariate x where there is sparse data, the width of pointwise confidence interval is larger, but for the simultaneous confidence band I would have thought that it should not be affected by the sparsity of data at the extremes. Let’s say we have only p parameters for the estimation and the sample size n is substantially larger than p, then the estimation should be quite good regardless of the sparsity of data at some values of the covariate x. Is this a reasonable thinking? Again, the guidance in this question is much appreciated.

I don’t know how to construct such a region and it would need to be too wide in the center where we have more confidence

Dear Experts:
I am an epidemiologist, and I am doing a dose-response analysis. I drew the restricted cubic spline using the rms package in R. I found the values of exposure (the value of x axis in restricted cubic spline) could be listed uing rms::Predict, but I could not understand these values, it seems not the value of original exposure.

For example, I made a sample dataset(named dt) including two varibales(age and outcome), and drew the restricted cubic spline.

library(data.table)
library(ggplot2)
set.seed(1234)
dt <- data.table(age = floor(runif(100, min  = 40, max = 100)),
                 outcome = floor(runif(100, min  =  0, max =   1.5))
)

rcs_dt <- rms::datadist(dt)
options(datadist = 'rcs_dt')
rcs_model <- rms::lrm(outcome ~ rms::rcs(age, 3),data = dt)

rcs_dt$limits["Adjust to","age"] <- 40
rcs_model <- update(rcs_model)
fit <- rms::Predict(rcs_model,age,ref.zero = TRUE, fun = exp)
ggplot(fit,aes(age, yhat))

“age” is a integer variable, but “age” switched to a variable with decimal and N = 200 in the fit output calculated by rms::Predict. I think the restricted cubic spline is drew by the fit output. How do I understand “age” in the fit output?

Besides, I want to obtain the odds ratios in knots (in this example, there are odds ratios in 3 knots in 10%, 50%, and 90%) like this paper(https://onlinelibrary.wiley.com/doi/10.1002/jcsm.12958 ). How could I obtain these odds ratios?

Thank you very much.

Please remove rms:: from your code and instead run require(rms) or library(rms) at the top.

Predict is considering your age variable to be continuous and is interpolating over a default grid of 200 values between the 10th smallest and 10th largest observed ages. If you want predictions only a integer ages do something like Predict(fit, age=20:80).

I wouldn’t compute odds ratios at knots but rather at values that you pre-specify using code such as

summary(fit, age=c(50, 20))   # compare age 50 to age 20

Many thanks for your kind guidance Dr. Harrell.

So the Predict(fit) outputs interpolate value rather than original value. It solved my problems and I will try the code you teached me.

I would word it this way: summary, Predict, predict, contrasts are almost always used to get predictions or differences in predictions are user-specified points for the covariate. These points may or may not be found in the actual data.

Thank you for your further kind guidance.
I want to confirm it. That is to say, it is good to use the restricted cubic spline to observe the dose-response relationship, but there is no need to be obsessed with the odds ratios (or other estimates) for corresponding actual data.

In this paper ( https://onlinelibrary.wiley.com/doi/10.1002/jcsm.12958 ), the authers showed the hazard ratios for 10 th and 90th percentiles in the the restricted cubic spline. I wanted to imitate his way, so I asked you the abovementioned question. But the hazard ratios (or other estimates) are not always necessary. Am I understanding correctly?

1 Like

Yes. Choice of percentiles for getting effect ratios is pretty arbitrary.

Thank you for your kind guidance and I learned a lot from you.

Dear proff,
would like to know why we cant use ‘rms::’ .I have been using the function this way. Is that a bad coding practice?

It’s not bad if you use rms:: once in your script, say if you are using a single rms fitting function. If you are using functions like rcs it’s not good practice and may conflict with rms functionality.

1 Like

Dear prof,
How can I calculate 95 percent confidence interval for c-index and optimism corrected c-index?

The first is easy although very approximate, using the Hmisc rcorr.cens function’s standard error estimate. The second is hard and has been discussed in another RMS topic.

1 Like

Dear Prof. Harrell,

Thanks so much for the “rms” package, it is simply fantastic (and I wished I had used it before)! However, there is something I can’t get my head around, and it seems this topic has created some confusion among others, too (e.g. on stackexchange). I use a dummy example (code below) to illustrate. The example has only 3 variables: event (0/1), time (time-to-event), riskfactor (in {1, 2, 3, 4, 5}) as explanatory covariate. Note: for simplicity, I treat riskfactor as continuous (despite it taking only 5 values) so I can fit restricted cubic splines. I set the reference value arbitrarily at 4.12.

Here are some initial considerations before getting to my question.

Logistic model: fitlrm_rms ← lrm(event ~ riskfactor, data = d)
a. Plotting Predict(fitlrm_rms, …, ref.zero = F) shows the linear predictor, i.e., alpha + beta*riskfactor, which is the log-odds.

b. Plotting Predict(fitlrm, …, ref.zero = T) shows beta (log odds ratio) only, yet it will adjust it to the chosen reference value, such that the confidence interval collapses there.

Note: at least in the rms version I use, the y axis label for b. above states log odds, instead of log odds ratio. Look like a labelling mistake?

Cox model without restricted cubic splines: cphout ← cph(Surv(time, event) ~ riskfactor, data = d)
The model is h(t) = h_0(t)exp(betariskfactor).

a. Plotting Predict(cphout, …, ref.zero = F) shows beta (log hazard ratio), with reference value = mean(riskfactor). The CI collapses (zero width) there. The y label states “log Relative Hazard”, i.e. log(HR) [HR = Hazard Ratio]

b. Plotting Predict(cphout, …, ref.zero = T) shows the beta, yet it will adjust it to the chosen reference value (4.12), and the CI collapses there (zero width).

My question:

Cox model with restricted cubic splines: cphspl ← cph(Surv(time, event) ~ rcs(riskfactor, 3), data = d)
The model is h(t) = h_0(t)*exp(f(riskfactor)), where f is a cubic spline. Let me reverse the order of a. and b. here since I think I know what b. shows.

b. Plotting Predict(cphout, …, ref.zero = T) shows f(riskfactor), yet it will adjust it to the chosen reference value (4.12) and the CI collapses there (zero width). The y label states “log Relative Hazard”, i.e. log(HR).
a. Plotting Predict(cphout, …, ref.zero = F). I am not clear what this shows? The label also states “log Relative Hazard”, i.e., again a log(HR)? I have read from other posts here that you refer to this as “linear predictor”. This makes me think it should be f(riskfactor), yet with a different (which default?) reference value. What surprises me though is that the CI doesn’t collapse anywhere. I couldn’t get my head around this since there will be some riskfactor_0 such that f(riskfactor_0) = 0. Wouldn’t that imply then log(h(t;riskfactor_0)) = log(h0(t)) + log(1) = log(h0(t)); and that shouldn’t have any uncertainty around it? Heinzl & Kaider who developed SAS code for restricted cubic splines years ago also refer to this fact (no uncertainty around, see Section 3.1 in Heinzl H, Kaider A. Gaining more flexibility in Cox proportional hazards regression models with cubic spline functions. Comput Methods Programs Biomed. 1997 Nov;54(3):201-8. doi: 10.1016/s0169-2607(97)00043-6. PMID: 9421665.).

I truly appreciate your help to resolve my confusion. Thanks a lot already; I posted the code below.

Simon

# -------------------------------------------------------
# load libraries, define auxiliary function
# -------------------------------------------------------
library(rms)
library(gridExtra)

# ----------------------------------------------------------------------------------------
# PART I - logistic model
# ----------------------------------------------------------------------------------------


# -------------------------------------------------------
# generate data
# set seed; define sample size; 
# generate a prognostic risk factor (handled as linear
# despite having only 5 categories, this is for spline)
# simulate, event, time, create a data frame (name: d)
# -------------------------------------------------------
set.seed(1)
n <- 1000
riskfactor <- sample(c(1, 2, 3, 4, 5), size = n, replace = T, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
event <- rep(c(0, 1, 1, 1), each = n/4)
time <- sapply(riskfactor, 
               FUN = function(x){
                 rexp(1, 0.2*(x == 1) + 0.4*(x == 2) + 0.8*(x == 3) + 1.2*(x == 4) + 1.8*(x == 5))})

d <- data.frame(event = event, 
                riskfactor = riskfactor, 
                time = time)


# -------------------------------------------------------
# preparation for using rms package
# -------------------------------------------------------
drms <- datadist(d) 
options(datadist='drms')
refval <- 4.12
predrf <- seq(1, 5, by = 0.05)
drms$limits["Adjust to","riskfactor"] <- refval # this is an arbitrary value, just to illustrate



# -------------------------------------------------------
# show fitted/predicted cruve for both, log-odds and 
# log odds ratio
# -------------------------------------------------------

# log-odds
fitlrm_rms <- lrm(event ~ riskfactor, data = d)
predlodds_rms <- Predict(fitlrm_rms, riskfactor = predrf, ref.zero = FALSE)

# log-OR
predlor_rms <- Predict(fitlrm_rms, riskfactor = predrf, ref.zero = TRUE)

# plots - note prlm2 y axis label seems wrong (log odds instead of log odds ratio)
plrm1 <- plot(predlodds_rms)
plrm2 <-plot(predlor_rms)
grid.arrange(plrm1, plrm2, nrow = 1)

# ----------------------------------------------------------------------------------------
# End of PART I - logistic model
# ----------------------------------------------------------------------------------------



# ----------------------------------------------------------------------------------------
# PART II - Cox model (withou / with restricted cubic splines)
# ----------------------------------------------------------------------------------------
# Cox model with coxph
# here, when predicting the linear term, the log(HR) is relative to the 
# *mean* of riskfactor, if ref.zero = F
cphout <- cph(Surv(time, event) ~ riskfactor, data = d)

predcph <- Predict(cphout, ref.zero = F)
predcphref <- Predict(cphout, ref.zero = T)

# plots 
pcph1 <- plot(predcph)
pcph2 <-plot(predcphref)
grid.arrange(pcph1, pcph2, nrow = 1)



# final step: a more flexible model (restricted cubic spline)
# note that the CI doesn't collapse when ref.zero = F
cphspl <- cph(Surv(time, event) ~ rcs(riskfactor, 3), data = d)
pcphrsc1 <- plot(Predict(cphspl, riskfactor = seq(3, 5, by = 0.01), ref.zero = F),
           ylim = c(-0.8, 1))
pcphrsc2 <- plot(Predict(cphspl, riskfactor = seq(3, 5, by = 0.01), ref.zero = T),
           ylim = c(-0.8, 1))
grid.arrange(pcphrsc1, pcphrsc2, nrow = 1)