RMS Discussions

Drew_Levy · May 10, 2021, 1:41pm

Welcome to the RMS short course 2021.
This datamethods.org discussion will be one of the places participants can pose questions that come up in the RMS short course (RMSsc2021) that do not get addressed or satisfied in the course discussion or the Zoom chat.

Comments, impressions and advice are also welcome here. Comments and questions can also be posted in the Google doc devoted to RMSsc2021 (the link to that Google doc is in the RMSsc2021 “Access Information for 4-day RMS Course”-welcome email message that Frank sent on May 9.). Lots of activity there already.

Comments and questions can also be shared on Twitter at #rmscourse.

Frank welcomes vigorous discussion; and questions that express confusion or uncertainty or curiosity that others might also have are especially welcome.

Enjoy RMSsc2021!!!

Drew_Levy · May 10, 2021, 2:15pm

Here is the first question of the RMS 2021 short course.

I will reprise the initial question from RMSsc2020: Is it fair to say that the overarching objectives of regression models in general (and RMS in particular) are (1) to develop models that will make accurate predictions of responses for future observations, (2) to properly estimate and represent uncertainty, and (3) support valid inference (whatever that actually is)? And that everything else follows from these and are the necessary details?

I will augment this initial prologue-impersonating-a-question, suggesting that ‘what valid inference is’ may be an especially timely and vigorous topic in RMSsc2021, as methods that focus on robust posterior probability distributions, and how they can be used, are becoming increasingly prevalent and emphasized. Is it too soon to assert that ‘support valid inference’ should be construed as interpreting a posterior predictive probability distribution?

Clay_Ford · May 12, 2021, 1:34pm

Day 1 of the 2021 RMS short course was fantastic. Thank you Frank and Drew!

I had a question about the rcs function in the rms package. In the past when I wanted to use regression splines I used the ns function in the splines package that ships with base R. This evening I thought I would compare the rcs and ns functions and see if they resulted in similar model coefficients when used with ols and lm, respectively. They don’t, not even close. I know the coefficients themselves are not interpretable and not of interest, but was curious why they’re so different. I read the help pages for both functions and just don’t have the background or expertise to figure out how they differ. In the code below they both return identical model fits as seen in the plot, but the coefficients are wildly different. I’m guessing it’s due to some mathematical details that I probably shouldn’t worry about, but still thought I would ask.

Here’s my toy example.

# generate data
set.seed(1)
x <- sort(runif(100, 0, 4))
y <- 0.3 + cos(x*pi) + rnorm(100, sd = 0.5)

# model with rms::rcs
library(rms)
m1 <- ols(y ~ rcs(x, 5))
coef(m1)
#  Intercept          x         x'        x''       x''' 
#   1.694149  -2.681746  17.557837 -88.983558 117.812345 

# model with splines::ns
library(splines)
# get 5 knot quantiles from RMS book (p. 27)
kn <- quantile(x, probs = c(.05,.275,.5,.725,.95))
# have to specify interior and boundary knots separately in ns (!)
x_ns <- ns(x, knots = kn[2:4], Boundary.knots = c(kn[1],kn[5]))
m2 <- lm(y ~ x_ns)
coef(m2)

# (Intercept)       x_ns1       x_ns2       x_ns3       x_ns4 
#  0.7977105   2.1276917  -1.9824331  -3.0829379   0.9461600 

# fitted lines are identical; red line overplots the first black line
plot(x,y)
lines(x, fitted(m1), col = 1)
lines(x, fitted(m2), col = 2)

f2harrell · May 12, 2021, 1:44pm

The coefficients are not comparable since they use different bases (truncated power basis for rcs, B-spline for ns). Need to compare fits. This is done in General Multi-parameter Transformations in rms

ndelrocco · May 12, 2021, 9:09pm

Still chewing on a topic from day 1. Yesterday in response to a question, Frank briefly mentioned risk based decision making. I may have misunderstood the discussion, but I believe the question was in the context of using predicted probabilities of events from a well calibrated model to then inform clinical decisions, such as how to treat a patient with a specific set of characteristics. My questions are:

Using the predicted probabilities from a model to assign individuals to a discrete number of treatments almost seems like unintentionally categorizing your predicted probabilities. Is this a misunderstanding?
Is it ever appropriate to use ranges for the predicted probability of event to help assign patients to one of a series of treatments, for example, if the treatments are ordinal in severity? i.e. if the predicted probability of event is >0.9, they would receive the most aggressive treatment or something of that nature. If not, how can this be avoided while still using the information from the model to help choose a treatment?
Does anyone have any good references on risk based decision making or best practice examples of using a predictive model to inform clinical practice to help me clear this up?

Thanks!

f2harrell · May 13, 2021, 1:48am

Take a look at Section 19.3 of http://hbiostat.org/doc/bbr.pdf and especially the links that are listed in that section, starting with this. This is not categorization but instead is a translation of risk into an action. In the absence of a utility function the usual approach is for the physician to give the patient the best estimate of risk of an outcome under two treatment options, and to let the patient with some assistance from the physician decide if risks are the best course. Even though the utility function is not enunciated, it’s still there.

I haven’t seen an article about translating probabilities of various severities of outcomes into multiple possible actions. I hope someone can point us to such an article.

wbaker0621 · May 14, 2021, 5:51pm

Greetings All, following Day 2 of the RMS Short Course, I am attempting to implement the aregImpute function for a large dataset I am working with. Below is the syntax I am using for that command. Below it is the warning I am receiving when trying to implement this code. I welcome any help in identifying the challenge. The “REC_DR_MM_EQUIV_TX” variable is categorical with 3 levels (0, 1, 2).

Thanks to everyone in advance.

a ← aregImpute(~ OUT_MALIG + REC_DR_MM_EQUIV_TX + CAN_GENDER + REC_HGT_CM + REC_CREAT + REC_AGE_AT_TX + TX_YEAR + REC_DRUG_INDUCTION + REC_HTN + CAN_HIST_CIGARETTE + REC_RACE_WHITE + REC_WGT_KG + DON_ANTI_CMV_POS + REC_CMV_IGG_POS + HF_ETIOL + REC_EBV_STAT_POS + REC_HBV_ANTIBODY_POS + REC_MALIG + REC_DM, data=data_2, nk=4, n.impute = 5, pr=FALSE)

Error in xdf[j] ← nux[j] - 1 :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In areg(X[s, ], xf[s, i], xtype = vtype[-i], ytype = ytype, nk = min(nk), :
the following x variables have too few unique values for the value of nk;
linearity forced: REC_DR_MM_EQUIV_TX

Robert_Liehr · May 14, 2021, 6:51pm

Thank you Frank and Drew for the excellent course! I am wondering if you can help point me in the right direction. I have a manufacturing application where we want to reduce the rate of an event by installing a new process. The event cannot be directly observed, but we can assume it happened when the response variable is very low. So we will have skewed distributions with a few “defects” in the left tails. I want to avoid the temptation to label “outliers” as events, followed by some comparison of proportions. Which of the methods described in class do you think I should explore for a comparison of left tails? Thank you!

f2harrell · May 15, 2021, 12:24pm

Start with nk=0 (force linearity on all variables and if that works try nk=3, then nk=4. If None of those succeed, remove one variable at a time until it works with nk=4. This may help you find a culprit variable.

f2harrell · May 15, 2021, 12:26pm

If the following doesn’t look helpful, paste in a high-resolution histogram of the response variable and let’s take a look. I think that semiparametric regression models such as the proportional odds ordinal logistic regression model are likely to work in your situation.

wbaker0621 · May 15, 2021, 12:59pm

Thank you! Turned out the problem was the first variable (REC_DR_MM_EQUIV_TX) that is ordinal with values of 0, 1, and 2. Is there a special way of coding this type of variable? Would changing it to “None”, “One”, “Two” be preferred?

f2harrell · May 15, 2021, 1:33pm

It depends on whether they were factor variables, which could cause a small cell sample size problem, or they were numeric. May have to make numeric and force to be linear using I(x).

wbaker0621 · May 15, 2021, 2:44pm

Now, after using the ‘aregImpute’ and ‘fit.mult.impute’ functions (which both executed successfully), are we able to use the validate and calibrate functions? When I attempt to execute “v ← validate(f, B = 200)” on the dataset following the ‘fit.mult.impute’, I am getting an error stating: “Error in validate.lrm(f, B = 200) : fit did not use x=TRUE,y=TRUE”.

Can boostrapping be done using the validate and calibrate functions after MI using aregImpute?

Thanks

f2harrell · May 17, 2021, 12:03pm

Some customizations are needed to do that.

Updated resources here

Drew_Levy · May 17, 2021, 8:30pm

Dear RMS2021-short_course:
Apropos of the questions and discussion concerning either (i) the utility of DAGS (and structural causal models, more generally), and (ii) the utility of simulations for understanding your analysis, see Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations, International Journal of Epidemiology, 2021; https://doi.org/10.1093/ije/dyab096. “DAG-based data simulations can be used for systematic comparisons of data analytical strategies or for the evaluation of the role of uncertainties in a hypothesized DAG structure. … An important advantage of this approach is that harmful adjustment sets, that is those that introduce rather than reduce bias, can be identified in a straightforward manner using DAGs, even for some scenarios that are difficult to deal with using other approaches.”

pmbrown · May 20, 2021, 12:11pm

i know @f2harrell has said something about ordinal outcomes and repeated measures, and even advocated for proportional odds and praised a paper that used a bayesian PO approach (an RCT with repeated measures i believe) - i know this appeared on datamethods, but i’m struggling to find any of this. Can anyone remember? thanks

f2harrell · May 20, 2021, 3:25pm

My resources are here and here.

James_Smith · May 21, 2021, 9:03am

I’m hoping to get a bootstrapped confidence interval for the c-index using validate(). I’ve used the method here that Frank suggests, though it is very slow because the model needs to be fit B x reps times. Is there a faster way to do this?

This is how I’m doing it currently:

require(rms)

DxyCI <- function (f, data, B, reps){
  n <- nrow(data)
  dxy <- numeric(reps) 
  for(i in 1:reps) {
    g <- update(f, subset=sample(1:n, n, replace=TRUE))
    v <- validate(g, B=B)
    dxy[i] <- v['Dxy', 'index.corrected']
  }
  quantile(dxy, c(.025, .975)) 
}

N <- 100
reps <- 100
B <- 100

data <- data.frame(Y = rep(c(0,1), N/2),
                   X1 = rnorm(N),
                   X2 = rnorm(N))

f <- lrm(Y ~ X1 + X2, data = data, x = T, y = T)

DxyCI(f, data, B, reps)

I’m wondering if it’s possible to replace the v <- validate(g, B=B) line with something like: Predict(g, data = data) and then extract just a single optimism corrected Dxy for each reps, and take the quantiles of those. I have done something like that using other packages, but would like to stick to RMS if possible. The problem is that I don’t think Predict() has an option to change the data on which the predictions are being made.

If this question would be better off in a different thread I can move it.

pmbrown · May 22, 2021, 11:54am

many thanks. I believe this is the study i was searching for:

wbaker0621 · May 28, 2021, 2:36pm

I am curious on folks’ thoughts surrounding use of restricted cubic splines for modeling an outcome such as time within therapeutic range. This is a measure ranging from 0% to 100%. In the project I am working on, there is a fair amount of information at the tails (0% and 100% TTR). See density plot below. If I modeled this variable using a RCS with 4 or 5 knots, do you all think it would behave appropriately? As a non-statistician I value everyone’s input. Thanks in advance.