RMS Discussions

Some customizations are needed to do that.

:new: Updated resources here

Dear RMS2021-short_course:
Apropos of the questions and discussion concerning either (i) the utility of DAGS (and structural causal models, more generally), and (ii) the utility of simulations for understanding your analysis, see Reflection on modern methods: understanding bias and data analytical strategies through DAG-based data simulations, International Journal of Epidemiology, 2021; https://doi.org/10.1093/ije/dyab096. “DAG-based data simulations can be used for systematic comparisons of data analytical strategies or for the evaluation of the role of uncertainties in a hypothesized DAG structure. … An important advantage of this approach is that harmful adjustment sets, that is those that introduce rather than reduce bias, can be identified in a straightforward manner using DAGs, even for some scenarios that are difficult to deal with using other approaches.”

2 Likes

i know @f2harrell has said something about ordinal outcomes and repeated measures, and even advocated for proportional odds and praised a paper that used a bayesian PO approach (an RCT with repeated measures i believe) - i know this appeared on datamethods, but i’m struggling to find any of this. Can anyone remember? thanks

My resources are here and here.

I’m hoping to get a bootstrapped confidence interval for the c-index using validate(). I’ve used the method here that Frank suggests, though it is very slow because the model needs to be fit B x reps times. Is there a faster way to do this?

This is how I’m doing it currently:

require(rms)

DxyCI <- function (f, data, B, reps){
  n <- nrow(data)
  dxy <- numeric(reps) 
  for(i in 1:reps) {
    g <- update(f, subset=sample(1:n, n, replace=TRUE))
    v <- validate(g, B=B)
    dxy[i] <- v['Dxy', 'index.corrected']
  }
  quantile(dxy, c(.025, .975)) 
}

N <- 100
reps <- 100
B <- 100

data <- data.frame(Y = rep(c(0,1), N/2),
                   X1 = rnorm(N),
                   X2 = rnorm(N))

f <- lrm(Y ~ X1 + X2, data = data, x = T, y = T)

DxyCI(f, data, B, reps)

I’m wondering if it’s possible to replace the v <- validate(g, B=B) line with something like: Predict(g, data = data) and then extract just a single optimism corrected Dxy for each reps, and take the quantiles of those. I have done something like that using other packages, but would like to stick to RMS if possible. The problem is that I don’t think Predict() has an option to change the data on which the predictions are being made.

If this question would be better off in a different thread I can move it.

1 Like

many thanks. I believe this is the study i was searching for:

2 Likes

I am curious on folks’ thoughts surrounding use of restricted cubic splines for modeling an outcome such as time within therapeutic range. This is a measure ranging from 0% to 100%. In the project I am working on, there is a fair amount of information at the tails (0% and 100% TTR). See density plot below. If I modeled this variable using a RCS with 4 or 5 knots, do you all think it would behave appropriately? As a non-statistician I value everyone’s input. Thanks in advance.

Before answering that specific question, you mentioned that TTR is an outcome but splines are usually used on predictors. What’s the goal of your study? And note some possible deficiencies with TTR that might make going back to the raw data on which it is computed a better idea.

  • observation periods can differ by patient
  • summing over all times hides time trends in coagulation
  • “therapeutic range” has not been fully validated and ignores important variation in coagulation parameters even when they stay on one side of the therapeutic range cutoff

Hi Frank, sorry for not being clearer about our research question. We are evaluating the association between time in therapeutic blood pressure range and clinical outcomes (cardiovascular & kidney) in a hypertensive population. The lead investigator initially broke TTR down into 6 categories of 0%, 100%, and quantiles. Thanks.

Sorry I thought TTR was used solely in anticoagulation therapy. Still the comments about TTR for hypertension control hiding longitudinal trends and stability holds. Having multiple baseline variables representing summaries in different time periods would be of interest. But to your immediate question, as long as you manually specify knot locations it should be very effective to use a restricted cubic spline for TTR. The large number of ties at TTR=0 and 1 mess up default knot location choices.

1 Like

What is the most efficient way to specify knot locations with the RMS package? I am used to allowing the package to choose for me [e.g. … rcs(TTR, 4)].

You can try the defaults and see if they make sense:

f <- fittingfunction(y ~ rcs(ttr, 5))
specs(f, long=TRUE)

Or specify your own locations: rcs(ttr, c(20, 30, 40, 50, ...)).

1 Like

The default for the y axis of the partial effects plot of an logistic regression model is log-odds. First, why was log odds the default option? Second, how can I change it to odds ratio for easier interpretability?

It is log odds because it is customary with most any regression model to show the linear predictor (X\hat{\beta}) scale. It is easy to convert this to the odds scale by specifying for example Predict(fit, fun=exp) but that is different from what you are requesting: odds ratios. To get an odds ratio you have to select a reference level of the predictor X, something that partial effect plots don’t require you to specify. It is harder to see what’s going on with reference levels when you have multiple predictors, so it’s best to take total control and to form contrasts, then pass these to your favorite plotting function as exemplified in the help file for contrast.rms. The contrast function is very general and handles any amount of nonlinearity or interaction that you happen to have put into the model. Once you get the contrasts you can anti-log them to get odds ratios, or look at the fun argument to contrast.

For example suppose that you wanted to get, for females, the odds ratios for age=x vs. age=30 for many values of x. You would do something like this to get a log scale:

k <- contrast(fit, list(sex='female', age=10:80), list(sex='female', age=30))
print(k, fun=exp)
# Plot odds ratios with pointwise 0.95 confidence bands using log scale
k <- as.data.frame(k[c('Contrast','Lower','Upper')])
ggplot(k, aes(x=10:80, y=exp(Contrast))) + geom_line() +
   geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)),
               alpha=0.15, linetype=0) +
   scale_y_continuous(trans='log10', n.breaks=10,
               minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) +
  xlab('Age') + ylab('OR against age 30')
3 Likes

My principal investigator is interested if there is an optimal cutoff for weight in determining failure (yes/no) of surgical implant (ankle replacement).
My thinking is that I want to create a logistic regression for failure, with weight with knots, then look at the anova of the model. I’m getting that the nonlinear p value is not significant for weight-- do I just stop from there? What else can I do to investigate nonlinearity of weight? Nevertheless, when I ggplot the model with knots for weight albeit nonsignificant, I’m seeing a curve for the weight (a slight u shape), which suggests that the odds changes at a particular cutpoint.

1 Like

Don’t use statistical significance for model selection. Specify as many knots as your effective sample size will support, up to 5 or 6. Look at confidence bands for the estimated function. Don’t try to simplify it. No threshold exists.

1 Like

Greetings all. I would appreciate any and all advice you can give this R newbie. While evaluating missing information in my dataset, I want to look at what variables are related to missingness through logistic regression. I have used the following to create a variable for missing:
data_miss$missing_HTN ← is.na(data_miss$REC_HTN). This provides me with a false/true result.

When I then use that as the dependent variable in lrm: HTN_MISS_REG = lrm(missing_HTN ~ ., data=data_miss) I get an error: Error in fitter(X, Y, offset = offs, penalty.matrix = penalty.matrix, :
NA/NaN/Inf in foreign function call (arg 1).

If I convert the missingness variable to a 0/1 factor and re-run lrm I get the following error: Error in names(numy) ← ylevels : ‘names’ attribute [2] must be the same length as the vector [1]

Use lrm(is.na(x) ~ ..., data=). But your main problem is that ~ . can be a bit dangerous because it’s not clear what is in ..

Would it be preferred to individually list each item in the datafile separately in the model statement rather than using the “~.”?

It’s worth a try, and remember to not use $ inside a formula; use data=.