I have been tasked with developing a prognostic model. This particular area of medicine is fairly unique as many “risk factors” for prognosis have been identified and replicated, but not combined in a multivariable prognostic model using modern statistical best practices.

Prior papers have observed that an increase in a certain blood biomarker has been associated with a higher risk of the condition. This is biologically plausible and has been borne out in several prior papers. However, the increase has consistently been defined as a dichotomisation; either a 10% or higher increase in the biomarker compared to a baseline, or not. This has several obvious problems: 1) lower baseline values must increase by a smaller amount to reach this threshold, 2) the dichotomisation obscures and loses continuous information, 3) the dichotomisation magnifies the effect of measurement error, and 4) the blood measurements are not necessarily measured at fixed time points, and variable time may elapse between the two (or more) measurements. There is strong reason to believe that both the absolute value of the measurements and the magnitude of the rate of change between the two measurements are important factors in the subsequent risk of the condition.

I’m having trouble finding literature on how this problem is optimally addressed. One intuitive proposal would be to include measurement_1 and measurement_2 as independent continuous variables with an interaction term between the two. However, measurement_1 and measurement_2 are likely very highly correlated and I imagine this might cause problems. This solution also doesn’t address the problem of measurements occurring at a variable interval of time. Another possibility would be including measurement_1 and a composite variable “measurement_2 - measurement_1 / elapsed_time_between measurements”, but that also feels off somehow.

Is there any literature on the topic that could be linked to and discussed on this forum?

To start of the discussion, Repeatedly measured predictors: a comparison of methods for prediction modeling by Welten et al. describes the exact problem I am trying to articulate. I am currently reading through this excellent paper but will try restate my problem in terms of its methods when I have understood it properly.

Edit: After some great responses I have added a few papers for the forum’s consideration:

An overwhelming collection of references on the rigorous method of evaluating prediction models known as decision curve analysis, can be found here:

Before developing a prediction model based upon the reports, I’d attempt to get access to the individual data and do an IPD meta-analysis.

I’d want to see if the markers were really informative across studies, or were just proxies for other variables not accounted for in the methods and discussion sections. You can’t really do that from summaries.

If you have funds and approval to do a study of your own, a Bayesian adaptive, sequential design could be used to 1. attempt replication of the initial findings, 2. explore other variables that have not been addressed adequately in the literature.

(I emphasize the importance of attempting to reproduce the published results, as Feynman did in his well known 1974 Caltech commencement address Cargo Cult Science).

Blockquote
I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person–to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know the the real difference was the thing she thought she had under control.
She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happened.

Some other posts to consider:

Added: Skepticism regarding the value of changes. Added: Bibliography on change score problems Zotero Report

Thank you for the references. I would of course adhere to best practices regarding the development of prognostic models, as detailed for instance in Clinical Prediction Models: A practical approach to Development, Validation and Updating by Ewout Steyerberg. I have read Vickers’ papers regarding Decision Curve Analyses and have reported these for other prognostic models I have worked on. I also agree with your advice to carefully choose variables for inclusion into multivariable prognostic models and not naively pick all risk factors that have previously been reported. The data I will be working with is from a large population-based prospective cohort study and is optimal for the purpose of developing a prognostic model.

What I am specifically aiming to discuss in this thread, is how longitudinal predictors might best be included in such a prognostic model.

I misinterpreted your question. Here are a few additional links on longitudinal response models I dug up. First one is a good intro to the complexities of this type of analysis (which you likely are aware of).

A more recent paper (2021) with a Bayesian perspective I only scanned:

it’s an interesting Q. I guess consider how the model would be used in practice. If it is the rate of change that is predictive then is that obtainable in practice? and if data are needed to predict an event soon before the event occurs then the model loses its utility? otherwise rate of change feels like a useful summary, but there would need to be some guidance about distance between timepoints, regression to the mean etc?

Good points! The clinical scenario is one of a slowly pregressive disease that has a long asymptomatic period with early elevations of certain biomarkers. The interesting part is that a certain subset of individuals never progress and could safely be discharged from monitoring. My idea is to design a two step process: the first step uses only the first blood sample and will hopefully be able to identify very high risk individuals that would require intensive monitoring. The rest would be offered a repeat measurement in 6-12 months and using this additional data a second prediction model would hopefully have the ability to identify very low risk individuals who could be discharged from monitoring. There will however likely be individuals in clinical practice who for some reason already have both measurements available at the time of the first clinical evaluation and these individuals could benefit from the added predictive ability of two measurements as long as it is not dependent on the exact timing of the measurements. This is in the idea phase.

Elias’ original statement of the problem is not a longitudinal data analysis but is a multiple baseline measurement problem. And the typical way such data have been analyzed in the literature is an abomination. Very expensive studies have told us nothing because they either dichotomized markers or they assumed that change (simple subtraction) captures the variables without modeling the two baselines as separate covariates. Note that collinearity does not ruin the latter approach. Almost always one finds that the latest measurement dominates in the prediction of future Y, and the change is not very useful. To account for varying time gaps g between the two measurements x1 and x2 one can fit a model that contains f1(x1) + f2(x2) + f3(g) (e.g., spline functions; linear if small sample size) and include two-way interactions between the x’s and g.

I think the joint modeling examples from @R_cubed might be interesting if there were more than just the two measurements. The {merlin} package in Stata/R can do some pretty neat things like fit a multivariate model on biomarker + time to event with the first/second derivative of biomarker model entering as a predictor in the time to event model.

I want to suggest something, even if only to know from others why it is problematic:

assume monoexponential growth/decay between x1 and X2

calculate the slope of the line between log(x1) and log(X2)

calculate the area under the curve between x1 and X2: (x1-x2)/slope (from section 2) and use this as a single number summary of x1,X2,gap.
and take the absolute value only if the marker is, say, increasing

Exponential decay makes a lot of sense; it’s just nonlinear in the parameters so takes fitting a special nonlinear model. I wouldn’t use a summary measure like AUC. Since we don’t know how the two measurements work together to predict the post-baseline outcomes we probably have to keep the two baselines as separate variables.