Choosing Knots for Splines Over Time

I’ve been using restricted cubic splines in my modeling work for quite a while now, and I usually follow Frank’s guidance on the number and placement of knots.

That said, I find things get trickier when splines are applied to time variables. In these cases, I tend to be more liberal with the number of knots, but I often catch myself second-guessing that choice.

Take the Sicily data example in sec 2.8 of Frank’s online book which uses spline terms in time:

library(rms)

getHdata(sicily)
d    <- sicily
dd   <- datadist(d);  options(datadist='dd')
f   <- Glm(aces ~ offset(log(stdpop)) + rcs(time, 6), data=d, family='poisson')

Suppose, we want to use AIC to guide the number of knots:

results <- data.frame(knots = integer(), AIC = numeric())

# Loop
for (k in 3:15) {
  model <- Glm(aces ~ offset(log(stdpop)) + rcs(time, k), data = d, family = poisson)
  results <- rbind(results, data.frame(knots = k, AIC = model$aic))
}

print(results)

print(results)
knots AIC
1 3 738.8582
2 4 721.0086
3 5 717.3928
4 6 721.5237
5 7 691.9370
6 8 682.5756
7 9 684.5746
8 10 682.2653
9 11 655.3435
10 12 652.2580
11 13 644.5240
12 14 636.9575
13 15 642.1793

In this case it takes 14 knots to minimize AIC. I come across some version of this issue almost every time I work with splines of time. So my question: besides sample size/EPV considerations, when using splines with time, are there any heuristics or guidance on how many knots to use?

2 Likes

Very nice work!

My experience comes mostly from independent observations where AIC is an excellent measure. With time series there are dependencies that may not need to be taken into account for forming estimates (if one pre-specified the number of knots), but I’ll bet that the dependencies make AIC not as useful here. The same issue would come up with cross-validation. How would you inform CV of serial correlation. There is something called the moving block bootstrap that you may want to look into.

Thanks, Frank. That makes sense. So, if I understand, in your approach you decide you can spend x degrees of freedom at most to spline a time var and you spend them. How does one pre-specify the number of knots in this context? Does the usual advice apply (i.e. ~3 to 5 depending on sample size)?

You limit the number of knots by the effective sample size, which is very hard to compute for the Sicily example. Or use a more subjective approach that I probably used: start with 7 knots and work your way down until impossible-to-explain local humps disappear :slight_smile: