I am planning to develop and internally validate a **diagnostic prediction model** from a **retrospective** cohort database. I will follow a cross-sectional study design, hence I’ll apply a **logistic regression** model. My population is consisted by patients in intensive care units. In this dataset, **each patient is submitted to multiple tests during hospital stay**.

In brief, there are four scenarios regarding test results for each patient:

- All results are positive
- All results are negative
- Negative - Negative - … - Positive onwards
- Negative - … - Positive - … - Negative

I am unsure how to model this data structure appropriately in the development and internal validation of a diagnostic prediction model.

In the PROBAST article, they say:

“correct modeling methods are needed when each person can have more than 1 event, such as in a model of epilepsy seizure, where some persons have more than 2 seizures. Multilevel or random-effects (logistic or survival) modeling methods would be needed to avoid underestimation and bias in predictor effects.”

However, I could only find relevant articles for survival analyses.

Could anyone point out some guidance on this issue? Thanks!

Multistate models can account for repeated measurements over time. Having said that, a few things to consider:

- If your test results are continuous variables, then keep them in models as a continuous variable rather than dichtomising them as negative or positive. This will make your models model more powerful.
- If patients all get measurements at a similar time (eg when they first arrive in the ICU then daily), consider multiple log reg models, one for each day. The models following day 0 should include as variables the prior measurements.
- If the measurements are at more random times then you could include as a variable the time from presentation the the ICU for each measurement. I can’t say I’ve tried that, but would be interested in what some of the statistician gurus here think of that?

ps. I get confused with the terminology “diagnostic prediction model”. If the event has already occurred, then it is a diagnostic model (ie giving a probability of being correct if we say that someone has already had the event). To me “predictions” are for prognostic models. (However, there are some things like AKI defined by the observation of a rise in creatinine that occurs sometime after the fall in GFR. These are predicting the change in creatinine (the surrogate), but with the hope that they are giving a diagnostic probability of an event that has already happened).

1 Like

Also I’d like to get more clarity on time zero and on my observation that test outputs (usually continuous, sometimes ordinal, often abused into a dichotomy) are usually *independent* variables with a final diagnosis being the *dependent* variable.