Split-sample discovery/replication in absence of other alternatives

scboone · November 17, 2019, 3:56pm

Although I think most of the members in this community agree split-sample validation is not an optimal strategy for the internal validation of prediction models, I was wondering what this community thinks of the split-sample approach in discovery/replication settings such as I encountered in the following situation. I came across a study which roughly performed the following analyses:
A novel biomarker panel was used to measure around 100 biomarker concentrations and the authors wanted to evaluate if there were differences in the associations of two related but different measures (M1 and M2) with these biomarkers. The authors performed the following analyses:

Split sample into a discovery and replication set in a 2:1 ratio (2/3 discovery, 1/3 replication).
Linear regressions with the biomarkers as dependent and the different measures M1 and M2 as independent variables, adjusting for a series of confounding variables. (What I mean to say is they performed the regressions for each measurement independently and did not add M1 and M2 together in one model. So they performed 100 regressions for M1 and 100 for M2)
Apply a false-discovery rate correction (FDR of <0.05) and carry over any significantly associated biomarkers for replication.

The approach such as described here is certainly not unique and is used frequently by similar studies in fields using -omics measurements (such as metabolomics and proteomics). I encounter similar studies frequently, but I just started thinking about it now. I know ideally to replicate these findings we would use a second cohort with similar measurements, but given the background of this biomarker panel I know for a fact such a cohort is not available (yet). I wanted to know what your opinion is on the use of the split-sample design in this context and what you would think is the best approach in this situation.

My own thought so far were:

The study was already comparatively underpowered (participants and number of regressions is roughly 1:1). The split-sample design only enhances this problem and smaller effects/associations will more likely be missed (similar to the problems during prediction model development/validation).
The replication set is a random sample from the same study population as the discovery set and therefore not all peculiarities of the population are going to be weeded out (so associations that arose due to peculiarities of the sampling scheme for example will probably not be removed in the replication step, similar to how this approach does not give a realistic picture of the degree of overfitting that occurs in a prediction model).

So in effect, the same issues as those of prediction modelling with a split-sample design apply. I was interested what your recommendations would be in this setting. Does the split-sample design have any added value in your view or would you recommend for example just performing the ‘discovery-analyses’ in the entire study sample and leave the replication for when a second cohort with similar measurements becomes available?

f2harrell · November 17, 2019, 4:25pm

I missed the sample size. I suspect it’s inadequate just for the discovery phase if you don’t hold back any data. The strategy you described is likely to suffer greatly from a very high false negative probability.

Think more about ridge regression to not waste any of the sample information in trying to “name names”.

scboone · November 17, 2019, 4:53pm

Kept out the sample size as I didn’t intend to point to one specific study. In this case the total sample size was around 300 (which was split 2:1 as described).

This approach is just something I often encounter and there are many similar studies around with a similar combination of low sample size, split-sample approaches, lack of external replication cohort etc. that I was interested in what your views were on these approaches.

f2harrell · November 18, 2019, 1:55am

Split-sample validation requires a sample size in excess of 15,000 to work, i.e., to result in reliable predictive accuracy estimates and a stable list of features.

With n=300 it’s hard to analyze more than 1-2 candidate features. See this.