RMS Discussions

A random forest of classification trees was fit to the data. When using ‘type=“prob”’ in the predict function, we are requesting a matrix of class probabilities. The two classes are 0 (alive) and 1 (dead). The function produces a matrix with one row per subject and two columns of class probabilities. The first column is for the probability of class 0 (alive) and the second column is for the probability of class 1 (dead). Using the [,2] operator request that only the second column be kept, as we are interested in the predicted probabilities of death. As Frank noted above, val.prob requires a vector (not a matrix). If you don’t extract the desired column of predicted probabilities, val.prob won’t work as anticipated.

1 Like

You omitted the [,2] from the end of the predict statement (see my later response below). You need to extract the desired column from the prediction matrix.


Welcome to the RMS short course 2021.
This datamethods.org discussion will be one of the places participants can pose questions that come up in the RMS short course (RMSsc2021) that do not get addressed or satisfied in the course discussion or the Zoom chat.

Comments, impressions and advice are also welcome here. Comments and questions can also be posted in the Google doc devoted to RMSsc2021 (the link to that Google doc is in the RMSsc2021 “Access Information for 4-day RMS Course”-welcome email message that Frank sent on May 9.). Lots of activity there already.

Comments and questions can also be shared on Twitter at #rmscourse.

Frank welcomes vigorous discussion; and questions that express confusion or uncertainty or curiosity that others might also have are especially welcome.

Enjoy RMSsc2021!!!


Here is the first question of the RMS 2021 short course.

I will reprise the initial question from RMSsc2020: Is it fair to say that the overarching objectives of regression models in general (and RMS in particular) are (1) to develop models that will make accurate predictions of responses for future observations, (2) to properly estimate and represent uncertainty, and (3) support valid inference (whatever that actually is)? And that everything else follows from these and are the necessary details?

I will augment this initial prologue-impersonating-a-question, suggesting that ‘what valid inference is’ may be an especially timely and vigorous topic in RMSsc2021, as methods that focus on robust posterior probability distributions, and how they can be used, are becoming increasingly prevalent and emphasized. Is it too soon to assert that ‘support valid inference’ should be construed as interpreting a posterior predictive probability distribution?

1 Like

Day 1 of the 2021 RMS short course was fantastic. Thank you Frank and Drew!

I had a question about the rcs function in the rms package. In the past when I wanted to use regression splines I used the ns function in the splines package that ships with base R. This evening I thought I would compare the rcs and ns functions and see if they resulted in similar model coefficients when used with ols and lm, respectively. They don’t, not even close. I know the coefficients themselves are not interpretable and not of interest, but was curious why they’re so different. I read the help pages for both functions and just don’t have the background or expertise to figure out how they differ. In the code below they both return identical model fits as seen in the plot, but the coefficients are wildly different. I’m guessing it’s due to some mathematical details that I probably shouldn’t worry about, but still thought I would ask.

Here’s my toy example.

# generate data
x <- sort(runif(100, 0, 4))
y <- 0.3 + cos(x*pi) + rnorm(100, sd = 0.5)

# model with rms::rcs
m1 <- ols(y ~ rcs(x, 5))
#  Intercept          x         x'        x''       x''' 
#   1.694149  -2.681746  17.557837 -88.983558 117.812345 

# model with splines::ns
# get 5 knot quantiles from RMS book (p. 27)
kn <- quantile(x, probs = c(.05,.275,.5,.725,.95))
# have to specify interior and boundary knots separately in ns (!)
x_ns <- ns(x, knots = kn[2:4], Boundary.knots = c(kn[1],kn[5]))
m2 <- lm(y ~ x_ns)

# (Intercept)       x_ns1       x_ns2       x_ns3       x_ns4 
#  0.7977105   2.1276917  -1.9824331  -3.0829379   0.9461600 

# fitted lines are identical; red line overplots the first black line
lines(x, fitted(m1), col = 1)
lines(x, fitted(m2), col = 2)

The coefficients are not comparable since they use different bases (truncated power basis for rcs, B-spline for ns). Need to compare fits. This is done in General Multi-parameter Transformations in rms

1 Like

Still chewing on a topic from day 1. Yesterday in response to a question, Frank briefly mentioned risk based decision making. I may have misunderstood the discussion, but I believe the question was in the context of using predicted probabilities of events from a well calibrated model to then inform clinical decisions, such as how to treat a patient with a specific set of characteristics. My questions are:

  • Using the predicted probabilities from a model to assign individuals to a discrete number of treatments almost seems like unintentionally categorizing your predicted probabilities. Is this a misunderstanding?
  • Is it ever appropriate to use ranges for the predicted probability of event to help assign patients to one of a series of treatments, for example, if the treatments are ordinal in severity? i.e. if the predicted probability of event is >0.9, they would receive the most aggressive treatment or something of that nature. If not, how can this be avoided while still using the information from the model to help choose a treatment?
  • Does anyone have any good references on risk based decision making or best practice examples of using a predictive model to inform clinical practice to help me clear this up?


Take a look at Section 19.3 of http://hbiostat.org/doc/bbr.pdf and especially the links that are listed in that section, starting with this. This is not categorization but instead is a translation of risk into an action. In the absence of a utility function the usual approach is for the physician to give the patient the best estimate of risk of an outcome under two treatment options, and to let the patient with some assistance from the physician decide if risks are the best course. Even though the utility function is not enunciated, it’s still there.

I haven’t seen an article about translating probabilities of various severities of outcomes into multiple possible actions. I hope someone can point us to such an article.

Greetings All, following Day 2 of the RMS Short Course, I am attempting to implement the aregImpute function for a large dataset I am working with. Below is the syntax I am using for that command. Below it is the warning I am receiving when trying to implement this code. I welcome any help in identifying the challenge. The “REC_DR_MM_EQUIV_TX” variable is categorical with 3 levels (0, 1, 2).

Thanks to everyone in advance.


Error in xdf[j] ← nux[j] - 1 :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In areg(X[s, ], xf[s, i], xtype = vtype[-i], ytype = ytype, nk = min(nk), :
the following x variables have too few unique values for the value of nk;
linearity forced: REC_DR_MM_EQUIV_TX

Thank you Frank and Drew for the excellent course! I am wondering if you can help point me in the right direction. I have a manufacturing application where we want to reduce the rate of an event by installing a new process. The event cannot be directly observed, but we can assume it happened when the response variable is very low. So we will have skewed distributions with a few “defects” in the left tails. I want to avoid the temptation to label “outliers” as events, followed by some comparison of proportions. Which of the methods described in class do you think I should explore for a comparison of left tails? Thank you!

Start with nk=0 (force linearity on all variables and if that works try nk=3, then nk=4. If None of those succeed, remove one variable at a time until it works with nk=4. This may help you find a culprit variable.

If the following doesn’t look helpful, paste in a high-resolution histogram of the response variable and let’s take a look. I think that semiparametric regression models such as the proportional odds ordinal logistic regression model are likely to work in your situation.

Thank you! Turned out the problem was the first variable (REC_DR_MM_EQUIV_TX) that is ordinal with values of 0, 1, and 2. Is there a special way of coding this type of variable? Would changing it to “None”, “One”, “Two” be preferred?

It depends on whether they were factor variables, which could cause a small cell sample size problem, or they were numeric. May have to make numeric and force to be linear using I(x).

1 Like

Now, after using the ‘aregImpute’ and ‘fit.mult.impute’ functions (which both executed successfully), are we able to use the validate and calibrate functions? When I attempt to execute “v ← validate(f, B = 200)” on the dataset following the ‘fit.mult.impute’, I am getting an error stating: “Error in validate.lrm(f, B = 200) : fit did not use x=TRUE,y=TRUE”.

Can boostrapping be done using the validate and calibrate functions after MI using aregImpute?


1 Like

Some customizations are needed to do that.

:new: Updated resources here

Dear RMS2021-short_course:
Apropos of the questions and discussion concerning either (i) the utility of DAGS (and structural causal models, more generally), and (ii) the utility of simulations for understanding your analysis, see Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations, International Journal of Epidemiology, 2021; https://doi.org/10.1093/ije/dyab096. “DAG-based data simulations can be used for systematic comparisons of data analytical strategies or for the evaluation of the role of uncertainties in a hypothesized DAG structure. … An important advantage of this approach is that harmful adjustment sets, that is those that introduce rather than reduce bias, can be identified in a straightforward manner using DAGs, even for some scenarios that are difficult to deal with using other approaches.”


i know @f2harrell has said something about ordinal outcomes and repeated measures, and even advocated for proportional odds and praised a paper that used a bayesian PO approach (an RCT with repeated measures i believe) - i know this appeared on datamethods, but i’m struggling to find any of this. Can anyone remember? thanks

My resources are here and here.

I’m hoping to get a bootstrapped confidence interval for the c-index using validate(). I’ve used the method here that Frank suggests, though it is very slow because the model needs to be fit B x reps times. Is there a faster way to do this?

This is how I’m doing it currently:


DxyCI <- function (f, data, B, reps){
  n <- nrow(data)
  dxy <- numeric(reps) 
  for(i in 1:reps) {
    g <- update(f, subset=sample(1:n, n, replace=TRUE))
    v <- validate(g, B=B)
    dxy[i] <- v['Dxy', 'index.corrected']
  quantile(dxy, c(.025, .975)) 

N <- 100
reps <- 100
B <- 100

data <- data.frame(Y = rep(c(0,1), N/2),
                   X1 = rnorm(N),
                   X2 = rnorm(N))

f <- lrm(Y ~ X1 + X2, data = data, x = T, y = T)

DxyCI(f, data, B, reps)

I’m wondering if it’s possible to replace the v <- validate(g, B=B) line with something like: Predict(g, data = data) and then extract just a single optimism corrected Dxy for each reps, and take the quantiles of those. I have done something like that using other packages, but would like to stick to RMS if possible. The problem is that I don’t think Predict() has an option to change the data on which the predictions are being made.

If this question would be better off in a different thread I can move it.

1 Like