Using splines with time series data

MohanG · April 25, 2024, 11:55am

I need to fit a regression model of the form Y = Σₘ(Eₘ Hₘ), where m=1, 2, …, 36 indexes month.

Hₘ stands for hours in childcare in month m of life (known)
Eₘ measures the impact of an hour of childcare in month m on Y (to be estimated)

Domain knowledge suggests that Eₘ should vary smoothly as a function of m (rather than jumping around all over the shop). I am planning to incorporate this by modelling Eₘ as a 4-knot restricted cubic spline.

Am I right in thinking that this is an unusual use case and is not directly supported by rms? If so, my plan is to

a) obtain an orthonormal basis for the space of 4-knot restricted cubic splines on [1, 36], and then evaluate each basis function on the observed Hₘ.
b) feed the resulting values into a linear regression to obtain estimates of the coefficients multiplying each basis function.
c) Use these coefficients to construct the actual spline Eₘ

Any advice/shortcuts would be very welcome.

f2harrell · April 25, 2024, 12:22pm

I am seeing this sort of problem more often, for example when we have patient weights measured yearly for 5 years and want to estimate how the weight trajectory predicts future events, in a landmark analysis. You can fit data like yours by creating a tall and thin dataset with repeated Y, adding a time variable, and letting a spline of time interact with H. You’d have to use a cluster sandwich covariance estimator to get the right standard errors; the sandwich estimator recognizes duplicate information.

We need a way to do this in the original data format. The only thing that comes to mind right now is the custom code a model in Stan and fit a Bayesian model. I hope others have more ideas.

MohanG · April 25, 2024, 12:53pm

@f2harrell Thanks for the quick reply! I think I’m a bit confused, though. I want to minimise [Y - Σₘ(Eₘ Hₘ)]². If you are repeating Y and each row relates to a different month m, you are presumably minimising Σₘ[Y - Eₘ Hₘ]² ?

I’m sure I’ve got the wrong end of the stick… sorry to be slow…

f2harrell · April 25, 2024, 4:51pm

I don’t have an example, but I believe you can write time-varying coefficients as an interaction with time and the underlying variable.

MohanG · April 25, 2024, 5:54pm

In the end I ended up going with my orthonormal basis approach. In case anyone else wants to try it, it wasn’t very much code. (Hope it’s ok to paste here. NB I am new to R so am sure that can be made tidier, especially at the end!)

library(Hmisc)
library(pracma)
library(ggplot2)

xs <- 1:36

# Construct the orthonormal basis
ones <- rep(1, length(xs))
unnormalised_basis <- cbind(ones, rcspline.eval(xs, nk=4, inclx=TRUE))
basis <- gramSchmidt(unnormalised_basis)$Q

# Our model is
#     Y = Σₘ(Eₘ Hₘ) + (1 - λΣₘHₘ)(βF)
# where F ~ N(0, 1)
# and the Hs are modelled as sums of random step functions

random_hours <- function() {
  breakpoints <- sort(runif(3, min=9, max=42)) 
  stepfun(breakpoints, c(0, 50, 100, 150))(xs)
}

# Simulate hours/month and noise, and use regression to estimate E
estimate_E <- function(E, lambda, beta, noise_sigma, N) {
  # Create a df with one row for each child
  hours <- t(replicate(N, random_hours()))
  df <- data.frame(
    Hproj = hours %*% basis, 
    F = rnorm(N, 0, 1), 
    Y = hours %*% E + 
        (1 - lambda * rowSums(hours)) * beta * F + 
        rnorm(N, 0, noise_sigma)
  )
  
  coeffs <- lm(Y ~ Hproj.1 + Hproj.2 + Hproj.3 + Hproj.4 + F, data=df)$coefficients
  basis %*% c(coeffs['Hproj.1'], coeffs['Hproj.2'], coeffs['Hproj.3'], coeffs['Hproj.4'])
}

HOURS_PER_MONTH <- 24 * 7 * 31  # Upper bound
E <- basis %*% c(-8, 11, -3, 0) # True spline
lambda <- 200.0/HOURS_PER_MONTH
beta <- 200
noise_sigma <- 300
N <- 1000

estimated_Es <- replicate(5, estimate_E(E, lambda, beta, noise_sigma, N)[,1])

ggplot(mapping=aes(x = xs)) +
      ylim(-10, 10) +
      xlab("x") +
      ylab("y") +
      geom_line(aes(y = E, color="True Eₘ")) + 
      geom_line(aes(y = estimated_Es[,1], color="Estimated Eₘ (1)"), linetype='dashed') +
      geom_line(aes(y = estimated_Es[,2], color="Estimated Eₘ (2)"), linetype='dashed') +
      geom_line(aes(y = estimated_Es[,3], color="Estimated Eₘ (3)"), linetype='dashed') +
      geom_line(aes(y = estimated_Es[,4], color="Estimated Eₘ (4)"), linetype='dashed') +
      geom_line(aes(y = estimated_Es[,5], color="Estimated Eₘ (5)"), linetype='dashed')```

f2harrell · April 26, 2024, 3:07pm

I want to start with basics and see if what you developed is along the same lines. Suppose that the response variable is Y and that we are predicting Y from 4 previous measurements of some sort, with successive historical values of a, b, c, d. Suppose we have the model Y = \alpha + \beta_{1}a + \beta_{2}b + \beta_{3}c + \beta_{4} d with \beta_{i} = u + vi + wi^{2}. Then we have reduced 4 \beta to 3 parameters and the model can be re-written as

Y = \alpha + u[a + b + c + d] + v[a + 2b + 3c + 4d] + w[a + 4b + 9c + 16d]

and we can solve for \alpha, u, v, w using standard methods. u, v, w tell us the effect of the trajectory of a, b, c, d by giving us the weight of each historical value as a function of its measurement time.

MohanG · May 1, 2024, 1:12pm

@f2harrell So sorry for slow reply, notification didn’t arrive for some reason.

Yes, that formulation looks exactly right.