Calculating correlation between consecutive changes (dealing with spurious correlation)

Say I have a study with n individuals, and measure some outcome, Y, at three time points (baseline, 1 week and 4 weeks). I now calculate the relative changes Y1/Ybaseline and Y4/Y1. The correlation between these changes (corr(Y1/Ybaseline, Y4/Y1)) will be biased towards a negative value (-0.5 if the 3 measurements are independent) because of the shared term, Y1 (Spurious correlation of ratios - Wikipedia).

Is there a way to “correct” for this spurious correlation?

The problem is similar to the spurious correlation between a baseline value and a subsequent change, for which a number of approaches have been suggested (Assessing the Relationship between the Baseline Value of a Continuous Variable and Subsequent Change Over Time).

1 Like

i think in the literature it is referred to as mathematical coupling? maybe that’s a starting place to see what has been written about it(?)

also, i think people will suggest not using change scores (Frank Harrell gives a list of reasons, likely on his blog, i’ve lost the link), and just adjust for baseline

2 Likes

Thank you. Knowing the right words often reveal a previously hidden corner of the literature.

I found these discussions by Frank Harrell.

https://hbiostat.org/bbr/md/change.html#whats-wrong-with-change-in-general

Both interesting and relevant for the problem.

I am quite convinced that change scores should be avoided. However, a lot of published studies only report change scores, so it would be relevant to attempt to minimize the statistical shortcomings when interpreting these results.

2 Likes

A regression approach

The goal with this is to see if the change from Ybaseline to Y1 can predict the change from Y1 to Y4. I have not found any examples of this in the literature (that are not biased in the ways described above). The general recommendation seems to be to avoid change scores if possible and otherwise to correct for baseline value.

A naive model could look like this:

lm(Y4 ~ 0 + Y1 + I(Y1 - Ybaseline))

Here the two IVs are correlated. This seems like a problem, but I’m not entirely sure.

Using the mean of Y1 and Ybaseline may be better:

lm(Y4 ~ 0 + I((Ybaseline + Y1)/2) + I(Y1 - Ybaseline))

Here the IVs uncorrelated and I(Y1 - Ybaseline)) actually gets a coefficient of 0 if the three measurements (Y) are independent.

I’m not sure how to continue from here. I think I need to get a coefficient I(Y1 - Ybaseline)) that is the estimated effect difference between Y4 and Y1 (or alternatively (Ybaseline + Y1)/2).

Just use Y0 and Y1 to predict longitudinal data Y2 Y3 Y4. You’ll probably see in this landmark analysis that Y0 is ignorable.

1 Like

Thank you. This seems to fit with the data, though it still feels a bit counterintuitive :slight_smile: Does this allow me to make inference regarding the change between Y0 and Y1?

I have simulated some data

n <- 1000
# true values
Y0 <- rnorm(n, 5, 1)

a <- rnorm(n, 0, 1) # change
Y1 <- Y0 + a/2
Y4 <- Y1 + a

# add independent random noise
Y0 <- Y0 + rnorm(n, 0, 0.2)
Y1 <- Y1 + rnorm(n, 0, 0.2)
Y4 <- Y4 + rnorm(n, 0, 0.2)

lm(Y4 ~ 0 + Y0 + Y1)

Call:
lm(formula = Y4 ~ 0 + Y0 + Y1)

Coefficients:
    Y0      Y1  
-1.301   2.302
2 Likes

The suggested analysis does that and more. It goes beyond the assumption that the change is meaningful (you’ll find that it isn’t) by estimating what linear combination of the two baselines you should have used. If as is almost always the case the most recent value is more important than the older measurement you’ll see a larger absolute regression coefficient on the later measurement.

5 Likes

Hello all, it’s been quite some time since my last interaction with the datamethods community. I have come across the following puzzle and hope someone (especially @f2harrell) can illuminate the issue.

glimpse(data)
Rows: 201
Columns: 5
Impairment 81, 53, 66, 77, 64, 28, 34, 65, 55, 39, 41, 46, 76, 70, 64, 50, 6…
Change 31, 3, 24, 10, 52, 16, 19, 50, 21, 29, 21, 31, 0, 60, 11, 39, 44,…
MEP Positive, Negative, Positive, Positive, Positive, Positive, Posit…
Baseline 19, 47, 34, 23, 36, 72, 66, 35, 45, 61, 59, 54, 24, 30, 36, 50, 3…
FollowUp 50, 50, 58, 33, 88, 88, 85, 85, 66, 90, 80, 85, 24, 90, 47, 89, 7…

The variables Baseline and FollowUp were created as follows:

Baseline = 100 - Impairment
FollowUp = Baseline + Change

Why do the two models below lead to different conclusions?

m1 = lrm(Change ~ rcs(Impairment, 3)*MEP, data = data, x = TRUE, y = TRUE)
anova(m1, test = ‘LR’)

m2 = lrm(FollowUp ~ rcs(Baseline, 3)*MEP, data = data, x = TRUE, y = TRUE)
anova(m2, test = ‘LR’)

In m1, there is an association between Impairment and Change, MEP and Change, and the interaction between Impairment and MEP is significant. However, in m2 there is an association only between Baseline and FollowUp, but not between MEP and FollowUp, and the interaction between Baseline and MEP is not significant. Additionally, the R² in m1 is 0.12, whereas in m2 is 0.77.

Since I first came across Frank Harrell’s material on why change score analyses are not appropriate I always frown upon it whenever I see it.

In the case above, would it be correct to conclude that the interaction found in m1 is artifactual?

Thanks!

I’m not sure about the big picture but are you making the mistake of assuming that things we learn from linear models carries over to nonlinear ones?

The study I’m referring to performed least squares regression (change ~ impairment + mep). I decided to use lrm as this is an ordinal outcome (strenght) and least squares does not fit the data well. Not sure I understand your question.

The study claimed that mep is informative as it adds to explaining recovery in strenght. But my point is that perhaps the effect they found is artifactual, since when we model baseline and follow up mep is not important anymore.

The latter model is the gold standard and makes the fewest assumptions when the model is nonlinear. For a linear model you can rescue a faulty model of Y=change by including baseline in X but that trick doesn’t work in general, e.g., for ordinal models.

1 Like