Incorporating "threshold effects" and smooth functions of the same variable

Ahmed_Sayed · May 20, 2024, 10:19am

Hello all,

I’m interested in assessing the value of a variable (above and beyond the usual set of variables we take into consideration) for predicting cardiovascular events. One of the covariates I am adjusting for is “coronary artery calcium” (CAC). It is essentially a measure of how much calcium an individual has deposited in their coronary arteries. Typically, the more the calcium, the worse the prognosis. I recognize apriori the relationship may not be linear and will fit CAC using a smooth/spline.

However, I am also concerned that there may well be a “threshold” effect wherein the presence/absence of CAC denotes somewhat of a threshold for the underlying biological process (namely atherosclerosis). That is, a CAC of 0 denotes the absence of a clear/gross manifestation of atherosclerosis and differs qualitatively from a low non-zero score (e.g., of 10) insofar as the latter implies the presence of an atherosclerotic process that has indeed made itself grossly manifest.

I understand that the use of a smoothed version of CAC will take care of this to some extent since the model will recognize a “sharp takeoff” near a CAC of 0. Still, I wonder if explicitly adding a binary variable denoting the absence (CAC = 0) or presence (CAC > 0) of CAC might be justifiable?

I’ve gone ahead and compared the 2 models: one with CAC as a smooth variable only and another with smooth CAC + a binary indicator for presence/absence of CAC (both models had other relevant predictors as covariates).

The later does have better fit as indicated by a likelihood ratio test, and the coefficient for the binary CAC indicator in the latter model (which, to reiterate, already includes smooth CAC) is far from trivial. Still, I don’t want to be purely guided by indices of fit, particularly as I don’t think I’ve come across this approach frequently (for CAC or other variables).

Has anyone encountered a scenario like this before? If so, do you think the incorporation of threshold effects in this manner is justifiable? My inclination is “Yes” since it’s not a data-driven arbitrary categorization but is actually based on our understanding of the underlying pathophysiology of the disease, but I’d like to see what others think.

martinspn · May 20, 2024, 11:30am

I think it is ok. I’ve faced the same with time when I wanted to distinguish patients fim whom an interval was 0 from the effect of time among patients who had a delay. It’s in a way inducing a discontinuity for the predictor which is not a problem. I’ve seen examples in regression books – don’t recall where specifically but it was not an original idea of mine.

I am just not sure of whether there would be a potential benefit of adding an interaction between the spline and the categorical terms – this way, the zeroes wouldn’t affect the predictions for patients with positive values.

f2harrell · May 20, 2024, 11:35am

If the smooth fit has enough flexibility near zero, e.g., you used a spline function that has 2 knots not far from zero, you did this in an excellent way. You can get the same result using the rms package gTrans function as exemplified here. If the smooth fit does not have enough flexibility near zero, the discontinuous jump at zero could capture a general lack of fit, not just a zero-order discontinuity. The examples in the link given above relate to this.

Ahmed_Sayed · May 20, 2024, 3:02pm

Thanks Martin!

For the interaction part, I think the problem is that, by including that interaction term, you’d essentially be asking the model to estimate a smooth function of time for people who had a time of zero.

When showing predictions, I guess the way to do it would be to specify something like predict(model, data.frame(time = 100, delay = 1)) for those with a 100-minute delay and predict(model, data.frame(time = 0, delay = 0)) for those with no delay. I do get your point about ideally wanting something where you wouldn’t have to “manually” differentiate between the two of them though when using predict.

Ahmed_Sayed · May 20, 2024, 3:02pm

Thanks a lot!

I’d initially used penalized thin-plate splines as implemented in the mgcv package (default setting), which allowed me to avoid having to specify the knots myself as it performs eigen-decomposition to automatically pick some N knots capturing most of the variance. My rationale was that, in the model with smooth CAC only, mgcv will allocate enough degrees of freedom at lower CAC values to optimize fit (insofar as it was possible to optimize fit without allowing for a discontinuity).

After replacing the default thin-plate spline with a cubic spline (2 knots near 0 + additional knots along the CAC spectrum), the model with a binary indicator for present/absent CAC still came out on top. I still had the smoothing penalty turned on though.

The only way I could “get rid” (so to speak) of the binary indicator’s predictive value was to turn the smoothing penalty off (for CAC). I assume this then allows the model to be as flexible as need be near the lower-end of CAC as it frees itself of the smoothing penalty. The problem with that is the unpenalized models spend approximately ~6 or so additional degrees of freedom (as opposed to the penalized versions which have fewer effective degrees of freedom due to the smoothing penalty).

My takeaway was that it seems like, when sharp discontinuities are suspected apriori on the basis of disease physiology, it may be simpler (more efficient?) to estimate a penalized smooth function of the relevant predictor and add a binary indicator for the desired discontinuity than it is to use an unpenalized smooth function (as the latter will spend quite a few more effective degrees of freedom in order to “find” the threshold effect).

f2harrell · May 20, 2024, 11:55pm

Perhaps. But I think you’d learn something from using a restricted cubic spline with manually-chosen knot locations include 2-3 knots in the very low range.