A random forest of classification trees was fit to the data. When using ‘type=“prob”’ in the predict function, we are requesting a matrix of class probabilities. The two classes are 0 (alive) and 1 (dead). The function produces a matrix with one row per subject and two columns of class probabilities. The first column is for the probability of class 0 (alive) and the second column is for the probability of class 1 (dead). Using the [,2] operator request that only the second column be kept, as we are interested in the predicted probabilities of death. As Frank noted above, val.prob requires a vector (not a matrix). If you don’t extract the desired column of predicted probabilities, val.prob won’t work as anticipated.
You omitted the [,2] from the end of the predict statement (see my later response below). You need to extract the desired column from the prediction matrix.
Welcome to the RMS short course 2021.
This datamethods.org discussion will be one of the places participants can pose questions that come up in the RMS short course (RMSsc2021) that do not get addressed or satisfied in the course discussion or the Zoom chat.
Comments, impressions and advice are also welcome here. Comments and questions can also be posted in the Google doc devoted to RMSsc2021 (the link to that Google doc is in the RMSsc2021 “Access Information for 4-day RMS Course”-welcome email message that Frank sent on May 9.). Lots of activity there already.
Comments and questions can also be shared on Twitter at #rmscourse.
Frank welcomes vigorous discussion; and questions that express confusion or uncertainty or curiosity that others might also have are especially welcome.
Enjoy RMSsc2021!!!
Here is the first question of the RMS 2021 short course.
I will reprise the initial question from RMSsc2020: Is it fair to say that the overarching objectives of regression models in general (and RMS in particular) are (1) to develop models that will make accurate predictions of responses for future observations, (2) to properly estimate and represent uncertainty, and (3) support valid inference (whatever that actually is)? And that everything else follows from these and are the necessary details?
I will augment this initial prologue-impersonating-a-question, suggesting that ‘what valid inference is’ may be an especially timely and vigorous topic in RMSsc2021, as methods that focus on robust posterior probability distributions, and how they can be used, are becoming increasingly prevalent and emphasized. Is it too soon to assert that ‘support valid inference’ should be construed as interpreting a posterior predictive probability distribution?
Day 1 of the 2021 RMS short course was fantastic. Thank you Frank and Drew!
I had a question about the rcs
function in the rms package. In the past when I wanted to use regression splines I used the ns
function in the splines package that ships with base R. This evening I thought I would compare the rcs
and ns
functions and see if they resulted in similar model coefficients when used with ols
and lm
, respectively. They don’t, not even close. I know the coefficients themselves are not interpretable and not of interest, but was curious why they’re so different. I read the help pages for both functions and just don’t have the background or expertise to figure out how they differ. In the code below they both return identical model fits as seen in the plot, but the coefficients are wildly different. I’m guessing it’s due to some mathematical details that I probably shouldn’t worry about, but still thought I would ask.
Here’s my toy example.
# generate data
set.seed(1)
x <- sort(runif(100, 0, 4))
y <- 0.3 + cos(x*pi) + rnorm(100, sd = 0.5)
# model with rms::rcs
library(rms)
m1 <- ols(y ~ rcs(x, 5))
coef(m1)
# Intercept x x' x'' x'''
# 1.694149 -2.681746 17.557837 -88.983558 117.812345
# model with splines::ns
library(splines)
# get 5 knot quantiles from RMS book (p. 27)
kn <- quantile(x, probs = c(.05,.275,.5,.725,.95))
# have to specify interior and boundary knots separately in ns (!)
x_ns <- ns(x, knots = kn[2:4], Boundary.knots = c(kn[1],kn[5]))
m2 <- lm(y ~ x_ns)
coef(m2)
# (Intercept) x_ns1 x_ns2 x_ns3 x_ns4
# 0.7977105 2.1276917 -1.9824331 -3.0829379 0.9461600
# fitted lines are identical; red line overplots the first black line
plot(x,y)
lines(x, fitted(m1), col = 1)
lines(x, fitted(m2), col = 2)
The coefficients are not comparable since they use different bases (truncated power basis for rcs, B-spline for ns). Need to compare fits. This is done in General Multi-parameter Transformations in rms
Still chewing on a topic from day 1. Yesterday in response to a question, Frank briefly mentioned risk based decision making. I may have misunderstood the discussion, but I believe the question was in the context of using predicted probabilities of events from a well calibrated model to then inform clinical decisions, such as how to treat a patient with a specific set of characteristics. My questions are:
- Using the predicted probabilities from a model to assign individuals to a discrete number of treatments almost seems like unintentionally categorizing your predicted probabilities. Is this a misunderstanding?
- Is it ever appropriate to use ranges for the predicted probability of event to help assign patients to one of a series of treatments, for example, if the treatments are ordinal in severity? i.e. if the predicted probability of event is >0.9, they would receive the most aggressive treatment or something of that nature. If not, how can this be avoided while still using the information from the model to help choose a treatment?
- Does anyone have any good references on risk based decision making or best practice examples of using a predictive model to inform clinical practice to help me clear this up?
Thanks!
Take a look at Section 19.3 of http://hbiostat.org/doc/bbr.pdf and especially the links that are listed in that section, starting with this. This is not categorization but instead is a translation of risk into an action. In the absence of a utility function the usual approach is for the physician to give the patient the best estimate of risk of an outcome under two treatment options, and to let the patient with some assistance from the physician decide if risks are the best course. Even though the utility function is not enunciated, it’s still there.
I haven’t seen an article about translating probabilities of various severities of outcomes into multiple possible actions. I hope someone can point us to such an article.
Greetings All, following Day 2 of the RMS Short Course, I am attempting to implement the aregImpute function for a large dataset I am working with. Below is the syntax I am using for that command. Below it is the warning I am receiving when trying to implement this code. I welcome any help in identifying the challenge. The “REC_DR_MM_EQUIV_TX” variable is categorical with 3 levels (0, 1, 2).
Thanks to everyone in advance.
a ← aregImpute(~ OUT_MALIG + REC_DR_MM_EQUIV_TX + CAN_GENDER + REC_HGT_CM + REC_CREAT + REC_AGE_AT_TX + TX_YEAR + REC_DRUG_INDUCTION + REC_HTN + CAN_HIST_CIGARETTE + REC_RACE_WHITE + REC_WGT_KG + DON_ANTI_CMV_POS + REC_CMV_IGG_POS + HF_ETIOL + REC_EBV_STAT_POS + REC_HBV_ANTIBODY_POS + REC_MALIG + REC_DM, data=data_2, nk=4, n.impute = 5, pr=FALSE)
Error in xdf[j] ← nux[j] - 1 :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In areg(X[s, ], xf[s, i], xtype = vtype[-i], ytype = ytype, nk = min(nk), :
the following x variables have too few unique values for the value of nk;
linearity forced: REC_DR_MM_EQUIV_TX
Thank you Frank and Drew for the excellent course! I am wondering if you can help point me in the right direction. I have a manufacturing application where we want to reduce the rate of an event by installing a new process. The event cannot be directly observed, but we can assume it happened when the response variable is very low. So we will have skewed distributions with a few “defects” in the left tails. I want to avoid the temptation to label “outliers” as events, followed by some comparison of proportions. Which of the methods described in class do you think I should explore for a comparison of left tails? Thank you!
Start with nk=0 (force linearity on all variables and if that works try nk=3, then nk=4. If None of those succeed, remove one variable at a time until it works with nk=4. This may help you find a culprit variable.
If the following doesn’t look helpful, paste in a high-resolution histogram of the response variable and let’s take a look. I think that semiparametric regression models such as the proportional odds ordinal logistic regression model are likely to work in your situation.
Thank you! Turned out the problem was the first variable (REC_DR_MM_EQUIV_TX) that is ordinal with values of 0, 1, and 2. Is there a special way of coding this type of variable? Would changing it to “None”, “One”, “Two” be preferred?
It depends on whether they were factor
variables, which could cause a small cell sample size problem, or they were numeric. May have to make numeric and force to be linear using I(x)
.
Now, after using the ‘aregImpute’ and ‘fit.mult.impute’ functions (which both executed successfully), are we able to use the validate and calibrate functions? When I attempt to execute “v ← validate(f, B = 200)” on the dataset following the ‘fit.mult.impute’, I am getting an error stating: “Error in validate.lrm(f, B = 200) : fit did not use x=TRUE,y=TRUE”.
Can boostrapping be done using the validate and calibrate functions after MI using aregImpute?
Thanks
Dear RMS2021-short_course:
Apropos of the questions and discussion concerning either (i) the utility of DAGS (and structural causal models, more generally), and (ii) the utility of simulations for understanding your analysis, see Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations, International Journal of Epidemiology, 2021; https://doi.org/10.1093/ije/dyab096. “DAG-based data simulations can be used for systematic comparisons of data analytical strategies or for the evaluation of the role of uncertainties in a hypothesized DAG structure. … An important advantage of this approach is that harmful adjustment sets, that is those that introduce rather than reduce bias, can be identified in a straightforward manner using DAGs, even for some scenarios that are difficult to deal with using other approaches.”
i know @f2harrell has said something about ordinal outcomes and repeated measures, and even advocated for proportional odds and praised a paper that used a bayesian PO approach (an RCT with repeated measures i believe) - i know this appeared on datamethods, but i’m struggling to find any of this. Can anyone remember? thanks
I’m hoping to get a bootstrapped confidence interval for the c-index using validate()
. I’ve used the method here that Frank suggests, though it is very slow because the model needs to be fit B
x reps
times. Is there a faster way to do this?
This is how I’m doing it currently:
require(rms)
DxyCI <- function (f, data, B, reps){
n <- nrow(data)
dxy <- numeric(reps)
for(i in 1:reps) {
g <- update(f, subset=sample(1:n, n, replace=TRUE))
v <- validate(g, B=B)
dxy[i] <- v['Dxy', 'index.corrected']
}
quantile(dxy, c(.025, .975))
}
N <- 100
reps <- 100
B <- 100
data <- data.frame(Y = rep(c(0,1), N/2),
X1 = rnorm(N),
X2 = rnorm(N))
f <- lrm(Y ~ X1 + X2, data = data, x = T, y = T)
DxyCI(f, data, B, reps)
I’m wondering if it’s possible to replace the v <- validate(g, B=B)
line with something like: Predict(g, data = data)
and then extract just a single optimism corrected Dxy for each reps
, and take the quantiles of those. I have done something like that using other packages, but would like to stick to RMS if possible. The problem is that I don’t think Predict()
has an option to change the data on which the predictions are being made.
If this question would be better off in a different thread I can move it.