Calculating correlation between consecutive changes (dealing with spurious correlation)

JohsEnevoldsen · June 14, 2021, 11:53am

Say I have a study with n individuals, and measure some outcome, Y, at three time points (baseline, 1 week and 4 weeks). I now calculate the relative changes Y1/Ybaseline and Y4/Y1. The correlation between these changes (corr(Y1/Ybaseline, Y4/Y1)) will be biased towards a negative value (-0.5 if the 3 measurements are independent) because of the shared term, Y1 (Spurious correlation of ratios - Wikipedia).

Is there a way to “correct” for this spurious correlation?

The problem is similar to the spurious correlation between a baseline value and a subsequent change, for which a number of approaches have been suggested (Assessing the Relationship between the Baseline Value of a Continuous Variable and Subsequent Change Over Time).

pmbrown · June 14, 2021, 1:36pm

i think in the literature it is referred to as mathematical coupling? maybe that’s a starting place to see what has been written about it(?)

also, i think people will suggest not using change scores (Frank Harrell gives a list of reasons, likely on his blog, i’ve lost the link), and just adjust for baseline

JohsEnevoldsen · June 15, 2021, 9:50am

Thank you. Knowing the right words often reveal a previously hidden corner of the literature.

I found these discussions by Frank Harrell.

https://hbiostat.org/bbr/md/change.html#whats-wrong-with-change-in-general

Both interesting and relevant for the problem.

I am quite convinced that change scores should be avoided. However, a lot of published studies only report change scores, so it would be relevant to attempt to minimize the statistical shortcomings when interpreting these results.

JohsEnevoldsen · June 23, 2021, 12:07pm

A regression approach

The goal with this is to see if the change from Ybaseline to Y1 can predict the change from Y1 to Y4. I have not found any examples of this in the literature (that are not biased in the ways described above). The general recommendation seems to be to avoid change scores if possible and otherwise to correct for baseline value.

A naive model could look like this:

lm(Y4 ~ 0 + Y1 + I(Y1 - Ybaseline))

Here the two IVs are correlated. This seems like a problem, but I’m not entirely sure.

Using the mean of Y1 and Ybaseline may be better:

lm(Y4 ~ 0 + I((Ybaseline + Y1)/2) + I(Y1 - Ybaseline))

Here the IVs uncorrelated and I(Y1 - Ybaseline)) actually gets a coefficient of 0 if the three measurements (Y) are independent.

I’m not sure how to continue from here. I think I need to get a coefficient I(Y1 - Ybaseline)) that is the estimated effect difference between Y4 and Y1 (or alternatively (Ybaseline + Y1)/2).

f2harrell · June 23, 2021, 12:24pm

Just use Y0 and Y1 to predict longitudinal data Y2 Y3 Y4. You’ll probably see in this landmark analysis that Y0 is ignorable.

JohsEnevoldsen · June 23, 2021, 1:01pm

Thank you. This seems to fit with the data, though it still feels a bit counterintuitive Does this allow me to make inference regarding the change between Y0 and Y1?

I have simulated some data

n <- 1000
# true values
Y0 <- rnorm(n, 5, 1)

a <- rnorm(n, 0, 1) # change
Y1 <- Y0 + a/2
Y4 <- Y1 + a

# add independent random noise
Y0 <- Y0 + rnorm(n, 0, 0.2)
Y1 <- Y1 + rnorm(n, 0, 0.2)
Y4 <- Y4 + rnorm(n, 0, 0.2)

lm(Y4 ~ 0 + Y0 + Y1)

Call:
lm(formula = Y4 ~ 0 + Y0 + Y1)

Coefficients:
    Y0      Y1  
-1.301   2.302

f2harrell · June 23, 2021, 2:42pm

The suggested analysis does that and more. It goes beyond the assumption that the change is meaningful (you’ll find that it isn’t) by estimating what linear combination of the two baselines you should have used. If as is almost always the case the most recent value is more important than the older measurement you’ll see a larger absolute regression coefficient on the later measurement.