Question about truncating values

Amo89tw · August 6, 2019, 3:50pm

Hi everyone,

I am reading Harrell’s book p74. “…it is beneficial to truncate measurements where the data density ends. In one dataset of 4,000 patients and 2,000 deaths, white blood count ranged from 500 to 100,000 with .5 and .95 quantiles of 2,755 and 26,700, respectively. Predictions from a linear spline function of WBC were sensitive to WBC > 60,000, for which there are 16 patients. There were 46 patients with WBC > 40,000. Predictions were found to be more stable when WBC was truncated at 40,000, that is, setting WBC to 40,000 if WBC > 40,000.”
I want to decide how many knots to use by checking AIC and trajectories. I am not sure I should truncate the data first or fit the model first.
Step 1. truncate the data to 40,000
Step 2. fit the model with k=3:7 and check the AIC to decide the knots needed
Step 3. plot predicted values using truncated data (or the original data?)

Any advice and suggestions will be greatly appreciated.

f2harrell · August 6, 2019, 7:19pm

Good question because I didn’t go into much detail in the book. The answer will depend on the nature of the variable. For variables that are highly skewed, a often remove the ability of the heavy right tail to have much impact by fitting a restricted cubic spline in the cube root of the variable, e.g. in R:

require(rms)
dd <- datadist(d); options(datadist='dd')
cr <- function(x) x ^ (1/3)
f <- ols(y ~ rcs(cr(x), 4) + x2, data=d)
ggplot(Predict(f))

Amo89tw · August 6, 2019, 10:53pm

Thank you @f2harrell ! Can I check if I get you correctly? You mean it is better to do a transformation instead of truncation?

f2harrell · August 7, 2019, 12:04pm

Yes, most of the time. If not wanting to pre-transform before splining, then truncation may be necessary.