Comparing 2 biomarkers using restricted cubic splines - how-to, interpretation, model building

tomkaier · December 16, 2018, 6:04pm

That’s how the interaction model looks - as you predicted, significant interaction between cMyC & creatinine:

> model1 <- lrm(AMI ~ MyC_0h + creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)
> model2 <- lrm(AMI ~ MyC_0h * creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)
> model1
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h creatinine     gender      age_y    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0          0          0 
 smoke_tot 
       102 

Logistic Regression Model
 
 lrm(formula = AMI ~ MyC_0h + creatinine + gender + age_y + dm_base + 
     chol_base + htn_base + previousMI + smoke_tot, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           605    LR chi2     122.73    R2       0.282    C       0.792    
  0            472    d.f.             9    g        1.427    Dxy     0.583    
  1            133    Pr(> chi2) <0.0001    gr       4.168    gamma   0.583    
 max |deriv| 1e-06                          gp       0.177    tau-a   0.200    
                                            Brier    0.128                     
 
                Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept      -3.6535 0.7315 -4.99  <0.0001 
 MyC_0h          0.0022 0.0004  5.11  <0.0001 
 creatinine     -0.0071 0.0038 -1.87  0.0610  
 gender=male     1.0364 0.2693  3.85  0.0001  
 age_y           0.0203 0.0094  2.16  0.0310  
 dm_base=Yes    -0.6064 0.3396 -1.79  0.0742  
 chol_base=Yes   0.2359 0.2977  0.79  0.4281  
 htn_base=Yes    0.1179 0.2443  0.48  0.6294  
 previousMI=Yes -0.0451 0.2574 -0.18  0.8610  
 smoke_tot=Yes   0.5989 0.2664  2.25  0.0246  
 
> model2
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h creatinine     gender      age_y    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0          0          0 
 smoke_tot 
       102 

Logistic Regression Model
 
 lrm(formula = AMI ~ MyC_0h * creatinine + gender + age_y + dm_base + 
     chol_base + htn_base + previousMI + smoke_tot, data = dat)
 
 
                        Model Likelihood     Discrimination    Rank Discrim.    
                           Ratio Test           Indexes           Indexes       
 Obs            605    LR chi2     160.92    R2       0.359    C       0.813    
  0             472    d.f.            10    g        2.780    Dxy     0.627    
  1             133    Pr(> chi2) <0.0001    gr      16.125    gamma   0.627    
 max |deriv| 0.0001                          gp       0.195    tau-a   0.215    
                                             Brier    0.119                     
 
                     Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept           -4.3255 0.7745 -5.59  <0.0001 
 MyC_0h               0.0123 0.0029  4.28  <0.0001 
 creatinine           0.0027 0.0049  0.56  0.5784  
 gender=male          0.9947 0.2899  3.43  0.0006  
 age_y                0.0158 0.0100  1.59  0.1122  
 dm_base=Yes         -0.4160 0.3351 -1.24  0.2145  
 chol_base=Yes        0.1780 0.3139  0.57  0.5707  
 htn_base=Yes         0.1187 0.2552  0.47  0.6419  
 previousMI=Yes      -0.0423 0.2677 -0.16  0.8745  
 smoke_tot=Yes        0.6103 0.2780  2.20  0.0281  
 MyC_0h * creatinine -0.0001 0.0000 -4.11  <0.0001 
 
> lrtest(model1, model2)

Model 1: AMI ~ MyC_0h + creatinine + gender + age_y + dm_base + chol_base + 
    htn_base + previousMI + smoke_tot
Model 2: AMI ~ MyC_0h * creatinine + gender + age_y + dm_base + chol_base + 
    htn_base + previousMI + smoke_tot

  L.R. Chisq         d.f.            P 
3.819232e+01 1.000000e+00 6.410432e-10

What I haven’t figured out yet is how to incorporate the interaction in a model using restricted cubic splines - as the argument rcs(MyC_0h, 4) * creatinine as used before when using splines does not compute:

singular information matrix in lrm.fit (rank= 13 ).  Offending variable(s):
MyC_0h'' * creatinine MyC_0h'' 
Error in lrm(AMI ~ rcs(MyC_0h, 4) * creatinine + gender + age_y + dm_base +  : 
  Unable to fit model using “lrm.fit”

f2harrell · December 17, 2018, 1:04am

It’s not clear if you can assume that the interaction is linear in creatinine. But allowing splines on both would make the numerical problems worse. You might use a restricted (in this case completely linear) interaction with rcs(x, 4) + creatinine + rcs(x, 4) %ia% creatinine.

tomkaier · December 17, 2018, 3:43am

Remarkable.

> lrm(AMI ~ rcs(log(MyC_0h), 4) + creatinine + rcs(log(MyC_0h), 4) %ia% creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h creatinine     gender      age_y    dm_base  chol_base   htn_base previousMI  smoke_tot 
         0          0         84          0          0          0          0          0          0        102 

Logistic Regression Model
 
 lrm(formula = AMI ~ rcs(log(MyC_0h), 4) + creatinine + rcs(log(MyC_0h), 
     4) %ia% creatinine + gender + age_y + dm_base + chol_base + 
     htn_base + previousMI + smoke_tot, data = dat)
 
 
                        Model Likelihood     Discrimination    Rank Discrim.    
                           Ratio Test           Indexes           Indexes       
 Obs            605    LR chi2     232.42    R2       0.490    C       0.876    
  0             472    d.f.            14    g        2.397    Dxy     0.751    
  1             133    Pr(> chi2) <0.0001    gr      10.994    gamma   0.751    
 max |deriv| 0.0003                          gp       0.258    tau-a   0.258    
                                             Brier    0.103                     
 
                       Coef     S.E.    Wald Z Pr(>|Z|)
 Intercept             -21.0085  9.8741 -2.13  0.0334  
 MyC_0h                  7.3550  4.0504  1.82  0.0694  
 MyC_0h'               -29.4534 21.8136 -1.35  0.1769  
 MyC_0h''               60.2886 45.6987  1.32  0.1871  
 creatinine              0.1871  0.1119  1.67  0.0945  
 MyC_0h * creatinine    -0.0786  0.0459 -1.71  0.0867  
 MyC_0h' * creatinine    0.3958  0.2455  1.61  0.1070  
 MyC_0h'' * creatinine  -0.8265  0.5122 -1.61  0.1066  
 gender=male             0.8189  0.3103  2.64  0.0083  
 age_y                  -0.0092  0.0117 -0.79  0.4274  
 dm_base=Yes            -0.5656  0.3571 -1.58  0.1132  
 chol_base=Yes           0.1981  0.3501  0.57  0.5716  
 htn_base=Yes            0.1522  0.2783  0.55  0.5844  
 previousMI=Yes         -0.2280  0.2879 -0.79  0.4284  
 smoke_tot=Yes           0.7081  0.3014  2.35  0.0188

Using both cardiac markers:

> lrm(AMI ~ rcs(log(MyC_0h), 4) + lsp(hsTnT_ambulance, tnt.quant) + creatinine + rcs(log(MyC_0h), 4) %ia% creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)
Frequencies of Missing Values Due to Each Variable
            AMI          MyC_0h hsTnT_ambulance      creatinine          gender           age_y 
              0               0               0              84               0               0 
        dm_base       chol_base        htn_base      previousMI       smoke_tot 
              0               0               0               0             102 

Logistic Regression Model
 
 lrm(formula = AMI ~ rcs(log(MyC_0h), 4) + lsp(hsTnT_ambulance, 
     tnt.quant) + creatinine + rcs(log(MyC_0h), 4) %ia% creatinine + 
     gender + age_y + dm_base + chol_base + htn_base + previousMI + 
     smoke_tot, data = dat)
 
 
                        Model Likelihood     Discrimination    Rank Discrim.    
                           Ratio Test           Indexes           Indexes       
 Obs            605    LR chi2     242.68    R2       0.507    C       0.882    
  0             472    d.f.            19    g        2.735    Dxy     0.764    
  1             133    Pr(> chi2) <0.0001    gr      15.408    gamma   0.765    
 max |deriv| 0.0004                          gp       0.264    tau-a   0.263    
                                             Brier    0.100                     
 
                       Coef     S.E.    Wald Z Pr(>|Z|)
 Intercept             -27.3372 11.2627 -2.43  0.0152  
 MyC_0h                  6.5097  4.1088  1.58  0.1131  
 MyC_0h'               -28.5706 22.3518 -1.28  0.2012  
 MyC_0h''               59.9634 46.8771  1.28  0.2008  
 hsTnT_ambulance         1.6207  1.1207  1.45  0.1481  
 hsTnT_ambulance'       -1.9221  1.4183 -1.36  0.1753  
 hsTnT_ambulance''       0.5135  0.5080  1.01  0.3120  
 hsTnT_ambulance'''     -0.2128  0.1101 -1.93  0.0533  
 hsTnT_ambulance''''     0.0001  0.0302  0.00  0.9965  
 creatinine              0.1805  0.1141  1.58  0.1136  
 MyC_0h * creatinine    -0.0778  0.0469 -1.66  0.0975  
 MyC_0h' * creatinine    0.4040  0.2522  1.60  0.1092  
 MyC_0h'' * creatinine  -0.8462  0.5262 -1.61  0.1078  
 gender=male             0.7561  0.3137  2.41  0.0159  
 age_y                  -0.0188  0.0124 -1.51  0.1301  
 dm_base=Yes            -0.6806  0.3588 -1.90  0.0578  
 chol_base=Yes           0.2252  0.3535  0.64  0.5240  
 htn_base=Yes            0.0775  0.2830  0.27  0.7843  
 previousMI=Yes         -0.2190  0.2895 -0.76  0.4494  
 smoke_tot=Yes           0.6990  0.3045  2.30  0.0217

When performing a fastbw on the models, the factors in the final model are still the same - MyC, creatinine & gender. Why do you think it is that I can’t use the interaction term in a model written as follows:

> lrm(AMI ~ rcs(log(MyC_0h), 4) + creatinine + rcs(log(MyC_0h), 4) %ia% creatinine + gender, data=dat)
singular information matrix in lrm.fit (rank= 8 ).  Offending variable(s):
MyC_0h'' * creatinine 
Error in lrm(AMI ~ rcs(log(MyC_0h), 4) + creatinine + rcs(log(MyC_0h),  : 
  Unable to fit model using “lrm.fit”

jroon · December 17, 2018, 11:21am

Interesting! Did you try to plot the spline fits? I always find that helpful in understanding these kinds of models.

If you plot the fit from the model Frank suggested, one thing you might also try is to fit this model:
x + rcs(creatinine , 4) + x %ia% rcs(creatinine, 4)
and also plot that fit. Assuming there are not numerical problems it would give you some idea of how non-linear the creatinine variable is. I’m sorry I’ not familiar enough with rms to explain the error code you are getting.

Peronsally, I don’t use automatic selection algorithms at all anymore. I think its better to make decisions on variable inclusion/exclusion based on your expert knowledge. DAGs can be helpful although I am but a novice with them myself. @f2harrell I’m curious your thoughts on this?

tomkaier · December 17, 2018, 12:41pm

Do you mean with Predict(model)? Here are two plots with MyC and creatinine separately:

model3 <- lrm(AMI ~ rcs(log(MyC_0h), 4) + creatinine + rcs(log(MyC_0h), 4) %ia% creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)
mod3.plot <- ggplot(Predict(model3, MyC_0h, fun=exp), anova=an3, pval=TRUE, ylab="Odds ratio")
mod3.plot <- ggplot(Predict(model3, creatinine, fun=exp), anova=an3, pval=TRUE, ylab="Odds ratio")

//edit: Now corrected plots for Odds ratio

tomkaier · December 17, 2018, 12:43pm

Also interesting that this model does not compute unless there are other variables included in the model. I’m working on trying to figure out why this is…

f2harrell · December 17, 2018, 12:52pm

I fully agree. Not only is variable selection unreliable, it distorts inference (standard errors, confidence intervals, p-values).

f2harrell · December 17, 2018, 12:53pm

It is not correct to relabel the y-axis without transforming it. This graph is showing log odds.

jroon · December 17, 2018, 1:04pm

Interesting plots!! For the MyC_0h plot you might want to add +scale_x_log10() because the model is fitting log of MyC_0h.

jroon · December 17, 2018, 1:05pm

Did you plot this fit too? I’d be curious to see that.

You could try reducing the rcs df to 3 to see if that makes numerical issues eaier. (sorry I missed it above if there is a specific reason your are using df = 4)

tomkaier · December 17, 2018, 2:05pm

Apologies, the fun=exp got lost in copy & paste. Now adjusted.

tomkaier · December 17, 2018, 2:07pm

Good point. See here

tomkaier · December 17, 2018, 2:17pm

Spot on!

> lrm(AMI ~ rcs(log(MyC_0h), 3) + creatinine + rcs(log(MyC_0h), 3) %ia% creatinine, data=dat)
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h creatinine 
         0          0         84 

Logistic Regression Model
 
 lrm(formula = AMI ~ rcs(log(MyC_0h), 3) + creatinine + rcs(log(MyC_0h), 
     3) %ia% creatinine, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     237.13    R2       0.439    C       0.857    
  0            532    d.f.             5    g        2.008    Dxy     0.713    
  1            160    Pr(> chi2) <0.0001    gr       7.449    gamma   0.713    
 max |deriv| 7e-08                          gp       0.252    tau-a   0.254    
                                            Brier    0.115                     
 
                      Coef     S.E.   Wald Z Pr(>|Z|)
 Intercept            -13.4744 3.6889 -3.65  0.0003  
 MyC_0h                 3.9502 1.1498  3.44  0.0006  
 MyC_0h'               -3.1741 1.2060 -2.63  0.0085  
 creatinine             0.0884 0.0427  2.07  0.0385  
 MyC_0h * creatinine   -0.0306 0.0132 -2.32  0.0202  
 MyC_0h' * creatinine   0.0321 0.0135  2.38  0.0173

You mean just biomarker, creatinine & gender? log10 transformed, stratified by gender:

jroon · December 17, 2018, 2:33pm

No sorry I meant Odds ratio vs creatinine when for the model where the rcs() is fit across the creatinine variable.

Oh on seeing the graphs I think perhaps the version without the scale_x_log10 is more informative - what do you think ?

tomkaier · December 17, 2018, 2:47pm

Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h creatinine     gender      age_y    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0          0          0 
 smoke_tot 
       102 

Logistic Regression Model
 
 lrm(formula = AMI ~ MyC_0h + rcs(creatinine, 3) + MyC_0h %ia% 
     rcs(creatinine, 3) + gender + age_y + dm_base + chol_base + 
     htn_base + previousMI + smoke_tot, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           605    LR chi2     177.73    R2       0.391    C       0.843    
  0            472    d.f.            12    g        4.451    Dxy     0.686    
  1            133    Pr(> chi2) <0.0001    gr      85.743    gamma   0.687    
 max |deriv| 2e-07                          gp       0.207    tau-a   0.236    
                                            Brier    0.116                     
 
                      Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept            -5.9314 1.3417 -4.42  <0.0001 
 MyC_0h                0.0728 0.0175  4.15  <0.0001 
 creatinine            0.0210 0.0169  1.24  0.2146  
 creatinine'          -0.0191 0.0186 -1.03  0.3049  
 MyC_0h * creatinine  -0.0008 0.0002 -3.99  <0.0001 
 MyC_0h * creatinine'  0.0008 0.0002  3.79  0.0002  
 gender=male           1.0152 0.3143  3.23  0.0012  
 age_y                 0.0179 0.0103  1.75  0.0803  
 dm_base=Yes          -0.3453 0.3360 -1.03  0.3041  
 chol_base=Yes         0.0861 0.3178  0.27  0.7865  
 htn_base=Yes          0.0825 0.2625  0.31  0.7534  
 previousMI=Yes        0.0192 0.2716  0.07  0.9438  
 smoke_tot=Yes         0.5987 0.2849  2.10  0.0356

Agreed.

jroon · December 17, 2018, 2:56pm

Ok wow that stays very linear even though it has the freedom to deviate from linear - in contrast to the biomarker! Seems rcs(log(MyC_0h), 3) is the better choice.

So what you could also plot for the model:
AMI ~ rcs(log(MyC_0h), 3) + creatinine + rcs(log(MyC_0h), 3) %ia% creatinine

make a plot for OR vs MyC_0h at different creatinine values, eg. Cr = 80, 120, 160, 200. Then if you want to also include gender you could facet_wrap by gender giving you a male and female facet for example.

To give you an idea - this is from a presentation I gave previously - the coloured lines are each at a certain fixed value for a continuous variable that interacts with the x varaible (which includes a spline):

tomkaier · December 17, 2018, 3:37pm

Thanks. CIs overlap in the following plot, so omitted but that should reflect the relationship - and is a nice way to present the interaction and gender-specifics:

jroon · December 17, 2018, 3:49pm

Wow big male to female differences.
Which model is behind this - its more linear than I expected (although - maybe not having gender in the model caused some of the apparent non-linearity in the graphs earlier).

I picked those values arbitrarily - you could just use 3 more spaced out ones if you want.

jroon · December 17, 2018, 4:06pm

I wonder why there is a funny little uptick at the origin for the creatinine = 200 curve??

tomkaier · December 17, 2018, 4:10pm

Good point - I had the same thought. The model was just based on MyC, creatinine & gender a la

lrm(AMI ~ rcs(log(MyC_0h), 3) + creatinine + rcs(log(MyC_0h), 3) %ia% creatinine + gender, data=dat)

Hence I’ve created the long model, using all available clinical variables:

lrm(AMI ~ rcs(log(MyC_0h), 3) + creatinine + rcs(log(MyC_0h), 3) %ia% creatinine + gender + age_y + dm_base + chol_base + htn_base + previousMI + smoke_tot, data=dat)

//edit:

Might this have to do with the distribution of creatinine at the extremes? See summary:

> summary(dat$creatinine)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  37.00   67.00   80.00   92.09   99.00  967.00      84

Which, in my mind, then also means we should apply a spline to creatinine?