Comparing 2 biomarkers using restricted cubic splines - how-to, interpretation, model building

tomkaier · November 19, 2018, 4:07pm

I have gone about solving the regression in the following manner:

Determine quantiles (0.01, 0.25, 0.5, 0.75) for each biomarker above the LoD
Fit linear splines with above quantiles - however, I got an error when trying to fit the model using lrm.fit with a 1% quantile above the LoD. It would only compute at 0.12 quantile - maybe I’m overlooking something?
Build regression model with all relevant risk factors, and as suggested by @kiwiskiNZ, including the CKD-EPI formula, thus omitting age & gender from the regression equation.

Code as below:

dat <- dat %>% mutate(
  myc.belowlod = case_when(
    MyC_0h <= 0.4 ~ 1,
    TRUE ~ 0
  ),
  tnt.belowlod = case_when(
    hsTnT_ambulance <= 5 ~ 1,
    TRUE ~ 0
  )
)

myc.quant <- round(as.numeric(quantile(subset(dat$MyC_0h, dat$myc.belowlod==0), probs = c(0.12, 0.25, 0.5, 0.75))))
tnt.quant <- round(as.numeric(quantile(subset(dat$hsTnT_ambulance, dat$tnt.belowlod==0), probs = c(0.12, 0.25, 0.5, 0.75)))) 
lrm(AMI ~ lsp(MyC_0h, myc.quant) + lsp(hsTnT_ambulance, tnt.quant) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # Full monty - i.e. all-inclusive model
lrm(AMI ~ lsp(MyC_0h, myc.quant) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # cMyC model
lrm(AMI ~ lsp(hsTnT_ambulance, tnt.quant) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # hs-cTnT model

Output:

Frequencies of Missing Values Due to Each Variable
            AMI          MyC_0h hsTnT_ambulance         gfr_epi         dm_base 
              0               0               0              84               0 
      chol_base        htn_base      previousMI 
              0               0               0 

Logistic Regression Model
 
 lrm(formula = AMI ~ lsp(MyC_0h, myc.quant) + lsp(hsTnT_ambulance, 
     tnt.quant) + gfr_epi + dm_base + chol_base + htn_base + previousMI, 
     data = dat)
 
 
                        Model Likelihood     Discrimination    Rank Discrim.    
                           Ratio Test           Indexes           Indexes       
 Obs            692    LR chi2     256.87    R2       0.469    C       0.866    
  0             532    d.f.            15    g        2.827    Dxy     0.732    
  1             160    Pr(> chi2) <0.0001    gr      16.897    gamma   0.732    
 max |deriv| 0.0001                          gp       0.263    tau-a   0.261    
                                             Brier    0.109                     
 
                     Coef     S.E.    Wald Z Pr(>|Z|)
 Intercept           -22.0606 10.4984 -2.10  0.0356  
 MyC_0h                1.9825  1.4746  1.34  0.1788  
 MyC_0h'              -2.2639  1.5749 -1.44  0.1506  
 MyC_0h''              0.3488  0.2605  1.34  0.1806  
 MyC_0h'''            -0.0350  0.0532 -0.66  0.5110  
 MyC_0h''''           -0.0321  0.0105 -3.05  0.0023  
 hsTnT_ambulance       0.5085  0.4831  1.05  0.2925  
 hsTnT_ambulance'     -0.4553  1.1983 -0.38  0.7039  
 hsTnT_ambulance''     0.1270  0.8750  0.15  0.8846  
 hsTnT_ambulance'''   -0.1716  0.1021 -1.68  0.0929  
 hsTnT_ambulance''''  -0.0044  0.0271 -0.16  0.8723  
 gfr_epi               0.0281  0.0057  4.92  <0.0001 
 dm_base=Yes          -0.9719  0.3163 -3.07  0.0021  
 chol_base=Yes         0.3064  0.3181  0.96  0.3354  
 htn_base=Yes          0.0979  0.2494  0.39  0.6948  
 previousMI=Yes       -0.0445  0.2483 -0.18  0.8578  
 
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h    gfr_epi    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0 

Logistic Regression Model
 
 lrm(formula = AMI ~ lsp(MyC_0h, myc.quant) + gfr_epi + dm_base + 
     chol_base + htn_base + previousMI, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     242.72    R2       0.448    C       0.856    
  0            532    d.f.            10    g        2.504    Dxy     0.711    
  1            160    Pr(> chi2) <0.0001    gr      12.231    gamma   0.711    
 max |deriv| 3e-05                          gp       0.257    tau-a   0.253    
                                            Brier    0.112                     
 
                Coef     S.E.   Wald Z Pr(>|Z|)
 Intercept      -18.4166 9.6059 -1.92  0.0552  
 MyC_0h           1.9705 1.4042  1.40  0.1605  
 MyC_0h'         -2.0841 1.4968 -1.39  0.1638  
 MyC_0h''         0.2164 0.2467  0.88  0.3803  
 MyC_0h'''       -0.0598 0.0502 -1.19  0.2334  
 MyC_0h''''      -0.0420 0.0082 -5.13  <0.0001 
 gfr_epi          0.0233 0.0052  4.44  <0.0001 
 dm_base=Yes     -0.8962 0.3145 -2.85  0.0044  
 chol_base=Yes    0.2060 0.3070  0.67  0.5023  
 htn_base=Yes     0.1628 0.2453  0.66  0.5069  
 previousMI=Yes  -0.0411 0.2472 -0.17  0.8678  
 
Frequencies of Missing Values Due to Each Variable
            AMI hsTnT_ambulance         gfr_epi         dm_base       chol_base 
              0               0              84               0               0 
       htn_base      previousMI 
              0               0 

Logistic Regression Model
 
 lrm(formula = AMI ~ lsp(hsTnT_ambulance, tnt.quant) + gfr_epi + 
     dm_base + chol_base + htn_base + previousMI, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     231.35    R2       0.430    C       0.849    
  0            532    d.f.            10    g        2.275    Dxy     0.698    
  1            160    Pr(> chi2) <0.0001    gr       9.724    gamma   0.698    
 max |deriv| 0.002                          gp       0.250    tau-a   0.248    
                                            Brier    0.116                     
 
                     Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept           -9.7911 2.6558 -3.69  0.0002  
 hsTnT_ambulance      0.6282 0.4268  1.47  0.1410  
 hsTnT_ambulance'    -0.4979 1.0905 -0.46  0.6480  
 hsTnT_ambulance''    0.1676 0.8135  0.21  0.8368  
 hsTnT_ambulance'''  -0.2298 0.0947 -2.43  0.0152  
 hsTnT_ambulance'''' -0.0616 0.0220 -2.81  0.0050  
 gfr_epi              0.0283 0.0055  5.11  <0.0001 
 dm_base=Yes         -1.0309 0.3104 -3.32  0.0009  
 chol_base=Yes        0.3356 0.3064  1.10  0.2734  
 htn_base=Yes         0.0565 0.2419  0.23  0.8153  
 previousMI=Yes       0.0463 0.2425  0.19  0.8486

I’m not convinced that the linear spline fits better than the restricted cubic spline quoted above? In fact, the LR chi2 is 259, R2 0.473 and C index 0.865 for the all-inclusive model using rcs, which I think points to a better model fit?

tomkaier · November 19, 2018, 4:13pm

Specifically, here’s my code for calculating the eGFR using CKD-EPI:

dat <- dat %>% mutate(
  crea_mgdl = creatinine/88.4,
  gfr_epi = case_when(
    race == "white" & gender == "female" ~ 141 * pmin(crea_mgdl/0.7, 1)^(-0.329) * pmax(crea_mgdl/0.7, 1) ^ (-1.209) * 0.993 ^age_y * 1.018,
    race == "white" & gender == "male" ~ 141 * pmin(crea_mgdl/0.9, 1) ^(-0.411) * pmax(crea_mgdl/0.9, 1) ^(-1.209) * 0.993 ^age_y,
    race == "black" & gender == "female" ~ 141 * pmin(crea_mgdl/0.7, 1)^(-0.329) * pmax(crea_mgdl/0.7, 1) ^ (-1.209) * 0.993 ^age_y * 1.018 * 1.159,
    race == "black" & gender == "male" ~ 141 * pmin(crea_mgdl/0.9, 1) ^(-0.411) * pmax(crea_mgdl/0.9, 1) ^(-1.209) * 0.993 ^age_y * 1.159,
    TRUE ~ NA_real_
  )
)

Need to play a little more on regression of eGFR with biomarkers - not convinced I’ve got a conclusive answer yet, but you can see the implementation of eGFR in the above analysis John!

f2harrell · November 19, 2018, 8:40pm

The cubic spline model is probably better. I just want to make sure a discontinuity isn’t needed near the threshold (detection limit). Not clear why the linear spline wouldn’t fit with the first knot closer to the limit. May be good to show us a high-resolution histogram of the measurements along with reference lines showing all the quantiles that were considered.

tomkaier · November 20, 2018, 4:30pm

Like this?

This is what I’ve tried - some quantiles obviously overlap. For both biomarkers, any quantile below 12% would cause the following error message:

myc.quant <- round(as.numeric(quantile(subset(dat$MyC_0h, dat$myc.belowlod==0), probs = c( .05, 0.25, 0.5, 0.75))))
lrm(AMI ~ lsp(MyC_0h, myc.quant) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # Full monty - i.e. all-inclusive model
singular information matrix in lrm.fit (rank= 10 ).  Offending variable(s):
MyC_0h' 
Error in lrm(AMI ~ lsp(MyC_0h, myc.quant) + gfr_epi + dm_base + chol_base +  : 
  Unable to fit model using “lrm.fit”

f2harrell · November 20, 2018, 10:37pm

You must have more ties in the data than I suspected, which would cause the difficulty. But the histograms are not high resolution enough. Just show the output of table() perhaps.

tomkaier · November 21, 2018, 10:17am

E.g. for cMyC:

> table(dat$MyC_0h)

    1.89     2.38     2.89     3.09     3.21     3.26     3.32     3.35     3.42 
       1        1        1        2        1        1        1        1        1 
    3.55      3.6     3.68     3.74     3.77     3.78     3.81     3.95     4.08 
       1        1        1        2        1        2        1        1        2 
    4.21     4.28      4.4     4.41     4.47     4.48     4.54      4.6     4.62 
       1        1        1        1        2        1        1        1        1 
    4.63     4.68     4.75      4.8     4.82     4.88     4.89     4.96     4.97 
       1        1        1        1        1        1        1        1        2 
    5.03     5.04     5.09     5.15     5.16     5.25     5.29      5.3     5.36 
       1        2        1        1        2        1        1        1        1 
    5.37     5.42     5.43     5.53     5.61     5.69      5.7     5.76     5.87 
       1        3        1        2        2        1        2        1        2 
     5.9     5.93     6.05      6.1     6.18     6.27      6.3     6.32     6.33 
       2        1        1        2        1        1        2        1        1 
    6.36     6.37     6.45     6.54     6.56     6.57     6.65     6.66     6.72 
       1        1        1        1        1        1        1        2        2 
    6.77     6.83     6.88      6.9     6.94     6.99     7.01     7.19     7.28 
       1        1        1        1        1        1        1        2        2 
    7.29     7.33     7.34     7.39     7.44     7.45     7.51     7.56     7.62 
       1        3        1        2        1        1        1        1        1 
    7.68      7.7     7.74     7.76     7.79     7.84     7.85     7.88     7.89 
       1        1        1        1        1        1        1        1        1 
    7.95     7.96     7.98     8.01     8.02     8.06     8.09     8.16      8.3 
       1        2        1        1        2        1        1        2        1 
    8.32     8.37      8.4     8.43      8.5     8.53     8.64     8.71     8.72 
       1        1        2        1        1        1        2        1        1 
    8.76     8.79     8.84     8.95     8.98     9.01     9.12     9.13     9.23 
       1        1        1        1        1        2        2        1        1 
    9.28     9.33     9.34     9.39     9.45     9.46     9.47      9.5     9.53 
       1        2        1        1        1        1        1        1        1 
    9.56      9.6     9.66     9.67     9.73     9.75     9.77     9.78     9.87 
       1        1        1        1        1        1        1        1        1 
    9.93     9.94     9.95     9.96     9.98    10.02    10.03    10.08    10.11 
       1        1        1        2        1        1        1        1        1 
   10.15    10.16    10.17    10.27    10.34    10.35    10.37    10.55    10.62

f2harrell · November 21, 2018, 12:53pm

Please remind me of the detection limit.

tomkaier · November 21, 2018, 1:08pm

Sorry, included in the plot. cMyC LoD is 0.4 ng/L, LoQ 1.2 ng/L

f2harrell · November 21, 2018, 3:25pm

I must be missing something obvious but I can’t see why putting a knot at 5.37 for the linear spline wouldn’t work, and this would effectively estimate the separate effect of being below the LOD. If that doesn’t work, then maybe at 6. Make sure the values below LOD are coded as LOD and not zero. Can also fit a restricted cubic spline with an extra knot just above 5.37 to allow complexity near LOD.

tomkaier · November 21, 2018, 3:56pm

All patients have a biomarker result above the LOD (for cMyC). I have trialled different cut-offs/knots now, the closest I can get to 5.37 is with a knot at 8 - not sure either why this might be? Also, 8 seems to be too close to 11 (0.25 quantile), so I had to adjust the second knot to 12 so that it can fit the model. This would be output from the cMyC-only model:

lrm(AMI ~ lsp(MyC_0h, c(8, 12, 24, 69)) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # cMyC model

Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h    gfr_epi    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0 

Logistic Regression Model
 
 lrm(formula = AMI ~ lsp(MyC_0h, c(8, 12, 24, 69)) + gfr_epi + 
     dm_base + chol_base + htn_base + previousMI, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     242.18    R2       0.447    C       0.857    
  0            532    d.f.            10    g        2.242    Dxy     0.713    
  1            160    Pr(> chi2) <0.0001    gr       9.416    gamma   0.713    
 max |deriv| 3e-06                          gp       0.257    tau-a   0.254    
                                            Brier    0.112                     
 
                Coef     S.E.   Wald Z Pr(>|Z|)
 Intercept      -12.1604 3.9805 -3.05  0.0023  
 MyC_0h           0.9627 0.5255  1.83  0.0670  
 MyC_0h'         -1.1365 0.6423 -1.77  0.0768  
 MyC_0h''         0.2934 0.2401  1.22  0.2217  
 MyC_0h'''       -0.0773 0.0559 -1.38  0.1662  
 MyC_0h''''      -0.0413 0.0082 -5.02  <0.0001 
 gfr_epi          0.0233 0.0052  4.45  <0.0001 
 dm_base=Yes     -0.8958 0.3146 -2.85  0.0044  
 chol_base=Yes    0.2116 0.3070  0.69  0.4907  
 htn_base=Yes     0.1616 0.2454  0.66  0.5103  
 previousMI=Yes  -0.0417 0.2474 -0.17  0.8662

So I’m not sure about that either. One would think there are enough cases at that knot?

f2harrell · November 21, 2018, 4:05pm

I hadn’t caught that. None of the complexities apply for that marker and ordinary knot locations can be used. Show partial effect plot for the linear spline for the one that had values below LOD, e.g. ggplot(Predict(...)).

tomkaier · November 21, 2018, 4:07pm

Checked some more - when you run the following, then the second knot seems to cause the problem:

lrm(AMI ~ lsp(MyC_0h, c(5.37, 12, 24, 69)), data=dat) # cMyC model
singular information matrix in lrm.fit (rank= 5 ).  Offending variable(s):
MyC_0h'

And it’s the same with the restricted cubic spline model. When removing the second knot, it works:

> mod2 <- lrm(AMI ~ rcs(MyC_0h, c(5.37, 24, 69)) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # cMyC model
> mod2
Frequencies of Missing Values Due to Each Variable
       AMI     MyC_0h    gfr_epi    dm_base  chol_base   htn_base previousMI 
         0          0         84          0          0          0          0 

Logistic Regression Model
 
 lrm(formula = AMI ~ rcs(MyC_0h, c(5.37, 24, 69)) + gfr_epi + 
     dm_base + chol_base + htn_base + previousMI, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     228.49    R2       0.426    C       0.849    
  0            532    d.f.             7    g        2.055    Dxy     0.699    
  1            160    Pr(> chi2) <0.0001    gr       7.806    gamma   0.699    
 max |deriv| 1e-06                          gp       0.250    tau-a   0.249    
                                            Brier    0.114                     
 
                Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept      -6.5082 0.7281 -8.94  <0.0001 
 MyC_0h          0.1363 0.0155  8.80  <0.0001 
 MyC_0h'        -0.1538 0.0177 -8.69  <0.0001 
 gfr_epi         0.0218 0.0052  4.22  <0.0001 
 dm_base=Yes    -0.8637 0.3127 -2.76  0.0057  
 chol_base=Yes   0.1800 0.3008  0.60  0.5495  
 htn_base=Yes    0.1335 0.2405  0.56  0.5789  
 previousMI=Yes -0.0262 0.2423 -0.11  0.9139

tomkaier · November 21, 2018, 4:16pm

Excellent, one aspect solved I take it rcs(MyC_0h, 4) for restricted cubic spline with 4 knots is adequate?

cTnT has values below LOD (5 ng/L) which I’ve re-coded to 4.99. The quantiles you suggested come out at 5, 8, 14, 31. The lsp model computes at a first knot of 5.5:

> mod3 <- lrm(AMI ~ lsp(hsTnT_ambulance, c(5.5, 8, 14, 31)) + gfr_epi + dm_base + chol_base + htn_base + previousMI, data=dat) # hs-cTnT model
> mod3
Frequencies of Missing Values Due to Each Variable
            AMI hsTnT_ambulance         gfr_epi         dm_base       chol_base 
              0               0              84               0               0 
       htn_base      previousMI 
              0               0 

Logistic Regression Model
 
 lrm(formula = AMI ~ lsp(hsTnT_ambulance, c(5.5, 8, 14, 31)) + 
     gfr_epi + dm_base + chol_base + htn_base + previousMI, data = dat)
 
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs           692    LR chi2     234.08    R2       0.434    C       0.849    
  0            532    d.f.            10    g        2.668    Dxy     0.699    
  1            160    Pr(> chi2) <0.0001    gr      14.407    gamma   0.699    
 max |deriv| 7e-05                          gp       0.251    tau-a   0.249    
                                            Brier    0.116                     
 
                     Coef     S.E.    Wald Z Pr(>|Z|)
 Intercept           -20.9423 11.7865 -1.78  0.0756  
 hsTnT_ambulance       2.8236  2.1877  1.29  0.1968  
 hsTnT_ambulance'     -2.8094  2.3351 -1.20  0.2289  
 hsTnT_ambulance''     0.3012  0.3781  0.80  0.4257  
 hsTnT_ambulance'''   -0.2481  0.0926 -2.68  0.0074  
 hsTnT_ambulance''''  -0.0608  0.0219 -2.77  0.0056  
 gfr_epi               0.0285  0.0056  5.13  <0.0001 
 dm_base=Yes          -1.0369  0.3103 -3.34  0.0008  
 chol_base=Yes         0.3546  0.3062  1.16  0.2469  
 htn_base=Yes          0.0543  0.2418  0.22  0.8222  
 previousMI=Yes        0.0454  0.2427  0.19  0.8515

tomkaier · November 21, 2018, 4:57pm

f2harrell · November 22, 2018, 1:55pm

Very informative. Sometimes we backup spline analyses with nonparametric regression (e.g., loess) but loess doesn’t work perfectly when Y is binary even if you turn off outlier detection.

kiwiskiNZ · November 26, 2018, 3:33am

I’ve been following this discussion about placement of knots. I too have noticed some difficulty in the past. I wonder if the knots are placed in such a way that either below the lowest, or between two knots the problem occurs because there are too few patients with the outcome of interest?

tomkaier · November 26, 2018, 8:23am

Very possible, John! Of course, we know that AMI is extremely unlikely below certain concentrations, and we might only find 1 or 2 cases of AMI below, say, 20 ng/L (of cTnT). This cohort consists also of very-early presenters, hence cTnT is very likely still in the early phases of release.

// edit: Although, I’m not sure whether the splines care about the number of cases at/below each knot? It would just be my assumption that this could influence whether a model works or not…

jroon · December 11, 2018, 4:23pm

Why would you consider creatinine (or eGFR) as a substitute for age?
A-priori, I would assume that while creatinine may have a relationship with age, both age and creatinine likely also have an independent relationship with risk for AMI. Personally I would try a model including both terms.

@tomkaier regarding splines - perhaps some of the other continuous variables also have nonlinear relationships with AMI? I would expect eGFR to have a non-linear relationship for example. Alternatively, perhaps there is an interaction between troponin and eGFR (given that worse eGFR will effect the half-life of elimination of troponin I would again expect such an interaction).

tomkaier · December 14, 2018, 9:41am

I guess John’s @kiwiskiNZ thought was that the GFR-epi formula incorporates age, gender and race and may be omitted in the logistic regression. In fact, GFR makes the model fit worse. In comparison, a model incorporating all cardiovascular risk factors, smoking status, gender, age and creatinine plus cardiac biomarkers (rcs for cMyC, lsp for cTnT) results in a R2 0.499 and C 0.879. If you run a fast backward variable selection (using AIC as the stopping rule) on this, the factors in the final model are cMyC, creatine and gender. This model yields an R2 0.447, C 0.862, but is obviously way more parsimonious and easily translated into a nomogram.

Good point! Splines haven’t improved the fit dramatically when used for creatinine. Looks fairly evenly distributed on histogram too.

jroon · December 14, 2018, 10:37am

Ah right yes I did not think of that, although in general its better to include each individual term over composite terms if you can.

Did you try the interaction model ? I would do that before adding splines. So do something like:
Model1: AMI ~ cMyc + creatintine + other_vars
Model2: AMI ~ cMyc * creatintine + other_vars
lrt(Model1, Model2)