In the development of prediction models using e.g. temporal or geographically distinct cohorts, assessing the true external performance (predictive accuracy) of the model developed is apparently very important.
Typically, in prognostic biomarker studies discrimination is the key metric presented, often with a data-driven cut point (e.g. selecting the log-rank cut point that minimizes the p-value). In the rare cases where a true external validation cohort exists, only discrimination is tested in the validation set, and the cut point more often than not separates survival groups also in the validation sets. As often noted, this latter approach is flawed, and there is no mention of e.g. cut point re-calibration or calibration itself. Although studies employing this pipeline are often (rightly so) pointed out on e.g. Twitter due to their mistakes, there is a glaring lack of tips to help non-experts on the right track. Of course this partly stems from that there are no uniform solutions to these questions.
For (retrospective) biomarker development studies (let’s limit it to single-marker studies performed due to a previous discovery, with a pre-selection of clinical risk parameters, for simplicity) with time to event data, one will often want to “benchmark” the proposed marker against (more or less) established risk prediction models or sets of individual risk factors. In this sense, there is to my understanding no model “development” like in prediction model development studies, and I am therefore unsure whether assessing model calibration in an external cohort and presenting calibration plots makes sense in this setting. Perhaps I am wrong, and the distinction is merely subtle? My coworkers, who are less interested in statistics than I am, think it is overkill and feel it’s sufficient to present overlaying Kaplan-Meier curves using quartiles of the discovery cohort to categorize the validation cohort (Royston & Altman BMC 2013) (using only the single biomarker, not prognostic indexes).
I have thus far not found clear guidelines for this, other than Altman (or was it Steyerberg?) suggesting that also these kinds of studies probably should report on calibration (alongside discrimination measures). Other guidelines, such as REMARK (2012) does not mention calibration. Isn’t the absolute level of the marker at the individual participant’s level, either alone or in an additive model including e.g. a staging marker, relevant in prognostic biomarker studies? Is this left out at this stage due to e.g. problems with biomarker stability, differences in analytical platforms etc. and hence difficulties with external validation, or could this simply be solved by pooling the cohorts and performing resampling with cohort affiliation as a covariate? Or is calibration simply not being reported due to lack thereof, due to low sample sizes/few registered events/endpoints, residual (“geographical”) confounding?
Although this comes out more as a discussion than a question, I would love to get some input on this.