Prediction Recruitment Duration

Johannes_Schwenke · January 5, 2026, 11:06am

I’m newly involved in a project, where we have access to individual patient recruitment data of 300 trials (>100k patients) + info on these trials. As most will know, most trials struggle with recruitment so there’s a lot of interest in being able to predict recruitment duration.

There’s been some work on this, e.g., here, here, here, and here. The approaches seem suboptimal, e.g., assume constant rate of accrual, or discard a lot of information, e.g., reduce data to quaretly recruitment numbers.

The main question that interests us is, after how much information, e.g., 10%, 20%, …, 50% the prediction of the total recruitment time does not improve meaningfully. In other words, if a trial has recruited 10% of its target sample size, and one uses this information make a prediction about how long the total recruitment is going to be, will this already be reasonably accurate or does one have to wait until e.g., 50%. I’m also aware that it might be “easy” to find such a threshold, even though none exists, and would like to avoid that.

Intuitively, I would have approach this as follows:

Have the data in long format, e.g., weekly data, where y is number of participants recruited in that discrete time interval for that trial.
Fit a flexible cumulative logit model, e.g.,:

y ~ f(time) + n_centers + … + (1|trial_id)

Use this model to predict expected number of patients recruited for a large time grid to create a curve → compare against the actually observed recruitment curve.
serial correlation / markov structure might also be interesting to explore

We could fit the same model (except maybe for how time is modelled), once on 10% of the data, once on 20% of the data, etc.

I’ve never seen someone do something similar, so there are surely many issues I’m missing + this all seems rather unelegant…

I’m unsure how one should evaluate and compare the predictions from these models, especially as the underlying data are not the same. And of course the model prediction will be better with more data, so I’m not even sure how one would go about saying that they are not meaningfully better anymore.

Of course, there are other various issues. E.g., trials might have taken measures because of slow recruitment, i.e., opened new centers, which complicates things further.

Would love to get some feedback on my thoughts. Thanks!

f2harrell · January 5, 2026, 1:32pm

I didn’t look at your references but I know there are papers in the literature that allow for flexible recruitment rates over time.

A variation on your analysis plan is to set a tolerance for the estimated relative number of persons recruited for trials that completed their recruitment. For example you may wish to have a 5% error, more accurately stated as a ratio \in [0.95, \frac{1}{0.95}]. Go backwards in time until the predicted relative recruitment falls outside that window. That gives you the breaking point for the predictions in terms of both the absolute number recruited and the relative fraction of elapsed time.

Johannes_Schwenke · January 6, 2026, 8:32am

That’s actually a very clever approach. Thanks! If I understand you correctly, we could do this for every trial, and then get a distribution of breaking points.

I think most trialists would like to answer this question:

“Given what we learned about recruitment pattern from past trials, and given the first X% of recruitment in my trial, what will the future recruitment look like?”

Translating this into a statistical workflow this would be taking an existing (bayesian) model, updating it we the current data from my trial, and then making predictions for the rest of my trial (?).

As I want to approximate this question with the n = 300 trials we have, I would be tempted to do LOOCV (but compute might be a limiting factor here).

Fit the data on 299 complete trials, and the first 95% of the left out trial.
Predict the remainder of the recruitment period of the left out trial.
Check if the prediction is inside the ratio \in [0.95, \frac{1}{0.95}].
Repeated 1.-3. for every (completed) trial.
Repeat 1.-4. of the above for 90%, 85%, …