Assessing generalizability in biomarker studies

In the development of prediction models using e.g. temporal or geographically distinct cohorts, assessing the true external performance (predictive accuracy) of the model developed is apparently very important.

Typically, in prognostic biomarker studies discrimination is the key metric presented, often with a data-driven cut point (e.g. selecting the log-rank cut point that minimizes the p-value). In the rare cases where a true external validation cohort exists, only discrimination is tested in the validation set, and the cut point more often than not separates survival groups also in the validation sets. As often noted, this latter approach is flawed, and there is no mention of e.g. cut point re-calibration or calibration itself. Although studies employing this pipeline are often (rightly so) pointed out on e.g. Twitter due to their mistakes, there is a glaring lack of tips to help non-experts on the right track. Of course this partly stems from that there are no uniform solutions to these questions.

For (retrospective) biomarker development studies (let’s limit it to single-marker studies performed due to a previous discovery, with a pre-selection of clinical risk parameters, for simplicity) with time to event data, one will often want to “benchmark” the proposed marker against (more or less) established risk prediction models or sets of individual risk factors. In this sense, there is to my understanding no model “development” like in prediction model development studies, and I am therefore unsure whether assessing model calibration in an external cohort and presenting calibration plots makes sense in this setting. Perhaps I am wrong, and the distinction is merely subtle? My coworkers, who are less interested in statistics than I am, think it is overkill and feel it’s sufficient to present overlaying Kaplan-Meier curves using quartiles of the discovery cohort to categorize the validation cohort (Royston & Altman BMC 2013) (using only the single biomarker, not prognostic indexes).

I have thus far not found clear guidelines for this, other than Altman (or was it Steyerberg?) suggesting that also these kinds of studies probably should report on calibration (alongside discrimination measures). Other guidelines, such as REMARK (2012) does not mention calibration. Isn’t the absolute level of the marker at the individual participant’s level, either alone or in an additive model including e.g. a staging marker, relevant in prognostic biomarker studies? Is this left out at this stage due to e.g. problems with biomarker stability, differences in analytical platforms etc. and hence difficulties with external validation, or could this simply be solved by pooling the cohorts and performing resampling with cohort affiliation as a covariate? Or is calibration simply not being reported due to lack thereof, due to low sample sizes/few registered events/endpoints, residual (“geographical”) confounding?

Although this comes out more as a discussion than a question, I would love to get some input on this.

Greetings limitededition (note, you are expected to use real names I think on this forum).

It think calibration is v important. My process for assessing the added value of a biomarker to a current model (or simply over and above clinical predictors A, B, C etc) is:

Baseline model:
Event ~ A + B + C
New model:
Event ~ A + B + C + Biomarker

Where the output is a prediction of the event look at various measures that assess discrimination including loglikelihood of overall models, Brier skill, delta AUC, Integrated Discrimination Improvement (separately for those with and without the event), and then various graphical techniques to compare models including calibration curves, risk assessment plots, and net benefit decision curves.

Only when I have a grasp of the overall added benefit (if any) of the biomarker do I look at cut-points that may be used to aid clinical decision making.

1 Like

Well put. I suggest focusing on the gold standard (log likelihood) and in quantifying new information added by the biomarker through the proportional increase in model likelihood ratio \chi^2 that comes from the log likelihood. This is explored here.


The big advantage I have yet to see explicitly stated is that reporting variables as inputs to regression models makes it possible to meaningfully aggregate research studies from different reports, even if it is just a team or lab synthesizing their own work. This is more consistent with a mathematically rigorous, information theoretic view of experimental design.

If we had tables (similar to those in Frank’s post) that reported the info added in by each variable in each study, it would be possible to see which variables are relatively consistent across studies, and which might be worth following up on if the reports seem to disagree.

It would also help the particular research community come to a consensus on the most informative study design protocol over time.

Gene Glass, a pioneer in the application of meta-analysis in psycholoy (that later spread to medicine like a virus), had the following to say about its fruitful application in a more modern context. Keep in mind this was written way back in 2000: Meta-Analysis at 25

Meta-analysis needs to be replaced by archives of raw data that permit the construction of complex data landscapes that depict the relationships among independent, dependent and mediating variables. We wish to be able to answer the question, “What is the response of males ages 5-8 to ritalin at these dosage levels on attention, acting out and academic achievement after one, three, six and twelve months of treatment?” … We can move toward this vision of useful synthesized archives of research now if we simply re-orient our ideas about what we are doing when we do research. We are not testing grand theories, rather we are charting dosage-response curves for technological interventions under a variety of circumstances. We are not informing colleagues that our straw-person null hypothesis has been rejected at the .01 level, rather we are sharing data collected and reported according to some commonly accepted protocols. We aren’t publishing “studies,” rather we are contributing to data archives.

Donald Rubin wrote the following on meta-analysis in 1992(!):

Rubin, D. B. (1992). Meta-Analysis: Literature Synthesis or Effect-Size Surface Estimation? Journal of Educational Statistics, 17(4), 363–374.

I am less happy, however, with more esoteric statistical techniques and their implied objects of estimation (i.e., their estimands) which are tied to the conceptualization of average effect sizes, weighted or otherwise, in a population of studies. In contrast to these average effect sizes of literature synthesis, I believe that the proper estimand is an effect-size surface, which is a function only of scientifically relevant factors, and which can only be estimated by extrapolating a response surface of observed effect sizes to a region of ideal studies.

Information fusion is not a specialized research technique, but a critical tool for every working scientist.
The inconsistent hodge-podge of Neyman-Pearson decision theory with Fisher’s information synthesis
has lead to a scientific literature that has more noise than signal on most topics due to the way results are analyzed and reported.


Thank you all for the help, this is such a great platform.

If we simplify John’s example, but instead use a survival object S, and then fit a baseline and a nested model, assuming no missing values, where A is a continuous composite (standard) risk score (this is another discussion):

fit0 <- cph(S ~ A)

fit1 <- cph(S ~ A + biomarker)

I can calculate the likelihood ratio χ2 statistic for each model separately, and compare the eventual improvement in model fit using lrtest


where I presume H0 is that there is no added value/improved fit of the nested model. Is that right?

Does this then mean that evidence of an improved fit to the observed data (not overfit and using biologically meaningful markers) also should add value in predicting future events (as long as the final model calibrates well in “unseen” patients)? And in situations where the discrimination (say c-index) does not improve although the LR test indicates otherwise, that may be explained by the c-index’ low sensitivity to detect incremented predictive value?

Do you by any chance have the link to an article that discusses how the c-index, being rank based, treats extreme predictions equally to “50/50” predictions?

This, by the way, is so well put:

Without knowing it, many translational researchers, by dichotomizing new biomarkers, are requiring additional biomarkers to be measured to make up for the lost information in dichotomized markers.

Good questions. When you have a pre-specified model, the likelihood ratio \chi^2 test and relative \chi^2 can be used without worrying about overfitting very much. It’s when you go to make actual predictions that overfitting is a big issue. And you don’t need to feel compelled to explain to someone why \Delta c is small, since c is too insensitive for looking at \Delta. For clinical utility look at the change in distribution of predicted risks. I don’t have a paper about specifically the failure of c to reward extreme predictions but you can see that from its definition and by drawing points for “concordance connections” for binary Y.

1 Like