Rigor and Reproducibility in a Research Network


The purpose of this topic is to generate discussion and develop guidelines for instilling rigorous and reproducible research in a collaborative network for which advanced bioinformatics and biostatistics data analyses will be conducted. Some of the analyses are done centrally and some analyses are done by methodologists at the clinical sites, using either whole-study or cite-generated data. For high-dimensional data it will be assumed that bioinformatics tools used in pre-processing have been well thought out and validated, and that possible batch effects will be studied and attempted to be accounted for in later analyses.

Data to be analyzed include both one-time assessments and complex longitudinal trajectories, and span clinical, molecular, imaging, and patient-oriented data. All the data are observational, and there are no clinical trials included in the program. In what follows, I use the term unsupervised learning (UL) to be synonymous with data reduction, both implying that outcome variables being predicted are not revealed to the analysis. Supervised learning (SL) means that ultimate patient outcomes are utilized during the analysis.

Analytic Tasks

There are at least three classes of analytic tasks to be undertaken:

  1. Traditional SL clinical prediction modeling based on raw or pre-processed data, without UL
  2. UL to reduce higher-dimensional data to lower dimensions as and end in itself (to be used in the future for a variety of patient outcomes). This includes clustering of patient trajectories of multiple dynamic patient descriptors, latent class analysis, principal components analysis, variable clustering, etc.
  3. UL that is quickly planned to be used to predict patient outcomes and differential treatment effects on patient outcomes. This is an UL-SL mixture.

Many of the analyses to be done involve complex multi-stage procedures whose discrete analytic components have been rigorously studied, but whose overall procedure’s performance has never been studied. That makes rigorous validation even more important.

Types of Validations

The most frequently used statistical validation procedure is split-sample validation, which is also the most problematic. The main problems with the approach is that it does not expose volatility/uncertainty in feature selection and pattern detection, and it is unstable. In other words, if one split the data 50-50 or 75-25 into training/test samples, developed a model on the training sample and attempted to validate it on the remaining data (test sample), one is often surprised how much both the trained model and the validated predictive accuracy change upon starting over with a new split, or just interchanging the roles of training/test in a 50-50 split. In one example with n=17,000 and a binary Y with 0.3 incidence (death in the ICU), repeating 50-50 split sample validation revealed substantive changes in the model coefficients and in the predictive discrimination (c-index; AUROC) obtained.

Because the research network will not have a large enough sample for split-sample validation to work properly (the maximum sample size in any one analysis will be n=4,000), validation will need to take place using either k repeats of 10-fold cross-validation (CV) or 300-400 bootstrap resamples. For larger n, k as small as 10 may be sufficient, but for smaller n, k may need to be 100. An advantage of repeated 10-fold CV is that it provides unbiased estimates of model performance even under extreme circumstances where n < p where p is the effective number of parameters estimated using SL. An advantage of the bootstrap is that the instability it reveals mimics that found in future samples, whereas CV understates volatility because two samples have 8/10 of the patients in common. A disadvantage is that the bootstrap understates overfitting in extreme situations (n < p).

Resampling-based validation is a type of strong internal validation, and split-sample validation may be called a weaker form of internal validation. For resampling-based validation to be strong, all analytical steps that in any way benefitted the analysis must be repeated afresh at each resample. For example, if any outcome data-based feature selection was done, the feature selection must be repeated from scratch for example resample. And comparing the set of features selected across resamples directly assesses the stability/trustworthiness of the feature selection process.

Besides choosing statistical validation procedures, there are vastly different forms the validations need to take:

  1. Standard validation of SL clinical prediction models involves using resampling to smoothly estimate overfitting-corrected calibration curves to gauge absolute predictive accuracy, and various indexes to evaluate predictive discrimination. The most clinically relevant measures of the latter are described here.
  2. Validation of structure, stability, and information capture in UL such as clustering patients. Validation of the meaning of derived patterns requires clinical thinking, so at the post consider only statistical aspects of validation of found patterns.

For 2., statistical validations need to include the following.

  • Sturdiness/stability/volatility: Do almost the same patterns emerge of repeated resampling, starting the pattern discovery process over at each sample (over a series of random tenths of left-out data, or over repeated bootstrap samples with replacement)
  • Homogeneity/information capture: When the pattern discovery process involves creation of discrete patient classes, verify that the patterns are adequate summaries of the underlying measurements they represent. To be a cluster, a cluster needs to be compact so that there is no meaningful heterogeneity in the cluster. This can be checked several ways when doing SL, including
    • Use the found classes to predict an important clinical outcome, then for each patient and each cluster compute a distance measure for the patient, with regards her cluster center. Validate that the predictive information in the discrete classes does not improve when adding distances from cluster centers.
    • Use the constituent raw data used to determine the classes to predict an important clinical outcome, and test to what degree they predict the outcome after adjusting for the found categories (clusters). A “chunk test” likelihood ratio \chi^2 test is one way to test for adequacy of the classes but testing the combined predictive contribution of the component variables.
    • Show that patients at the boundary of a cluster are more like the patients at the center of the cluster than they are like all the patients in another cluster.


  • SL in developing and validating clinical prediction models is well worked out. When UL is involved there are fewer papers on validation and reproducibility. When UL is fed into SL there has been even less research.
  • When UL is undertaken such that the final result is fed into clinical outcome prediction, but the process of UL is nowhere informed by patient outcomes, is it necessary to take the UL steps into account when doing CV or bootstrap validation of SL clinical prediction? I.e. do UL and SL need to be validated simultaneously, or can this be staged?
  • Bootstrapping and repeated 10-fold CV are labor-intensive strong internal validation procedures. As stated above they require all analytic steps, especially SL ones, to be repeated afresh at each resample. For some problems, 100 repeats of 10-fold CV are required to obtain adequate precision in accuracy measures (even though 5-10 repeats may reveal that a procedure is too unstable to use). For sites doing complex computer-intensive analysis, the personnel and computer-time burden should be noted. Is a hybrid approach possible and will it be rigorous enough?

One hybrid validation approach is for a participating site to “learn as they go” using simple split-sample procedures, but to commit to recording every data-based decision they made along the way. These decisions must be programmable. Then at the end of development, a grand repeated 10-fold CV or bootstrap study is undertaken to document performance and stability of the site’s approach using strong internal validation. For each resample, all steps attempted along the way, especially when involving SL, would be repeated and the original split-sample analyses superseded by the rigorous internal validation.

Is this the best hybrid approach and is it rigorous and reproducible? Are there strategies that will save some time at clinical sites (and at the coordinating center for centrally done analyses)? Are there comments and suggestions related to anything above?


Addressing just one very narrow issue in the above, specifically in strict SL:

Even strong internal validation based on resampling can be subject to the Picard-Cook Optimism Bias. Chatfield (1995) noted that if “a dataset from model A happens to have features which suggest model B, then the resampled data are also likely to indicate model B rather than the true model A”. Obtaining “more than one set of data, whenever possible, is a potentially more convincing way of overcoming model uncertainty and is needed anyway to determine the range of conditions under which a model is valid”.

A similar point was made in:

I am not saying that internal validation shouldn’t be done (for I have often done it myself, and it certainly makes the work more defensible), just that at some point, shoe leather will still be needed. My starting point for thinking about observational data is (Freedman, 1999) .

I realize that these comments are not actionable with respect to your purposes. I offer them simply for perspective and to keep expectations modest.


Thank you, Christopher, for stating this so well — ‘actionable’ or not. In a similar vein, I would advise considering that this proposal looks almost diametrically opposed to Platt’s Strong Inference — as if everybody will just sit around all day pondering the internal consistency of their hypotheses (or worse, ‘validating’ them!), and never get around to going to the lab to disprove any of them.

1 Like

Thanks for the excellent input Chris. The quote above pertains to a different situation, at first glance. It implicitly assumes that “dataset A suggests model B” is used in the analysis, whereas with strong internal validation dataset A is repeatedly resampled in such a way that each resample can suggest a different model, and no action is taken from an implied overall analysis on A. If each resample suggests model B there is no instability nor model uncertainty and the choice of model B to explain the data is solid.

1 Like

It is only stable with respect to repeated resampling of the data in your possession, ie, under constant re-use of the same data. What I think Chatfield was trying to say is that such repeated re-use of the same data will be influenced by the idiosyncrasies of that particular sample. Even with 4000 records, there could be certain combinations of covariates with each other and with the outcome that are sparsely represented, and may not be obviously faithful to model A.

To operationalize my comment, perhaps the study you describe could be embedded within a learn-and-confirm framework, where “Phase II” is based on the strong internal validation you describe, while “Phase III” is an attempt at external validation, including the use of multiple lines of evidence. For instance, if the clinical prediction machine seems to claim that the combination of exposures M, N, and O at certain thresholds results in elevated risk of outcome P, a physiological study with animal models could be pursued to seek a plausible mechanism of action.

I would like to acknowledge my debt to Frank: I first learned of Chatfield (1995) because it was cited in the first edition of Regression Modeling Strategies :slight_smile: It’s not his fault that I push ideas >10% further than he may think is reasonable :wink:

1 Like

Thanks Chris. You make an excellent point. It’s related to the fact that if you use a standard bootstrap to get a confidence interval on the mean from a small sample, the bootstrap inherits any weirdness of the small sample. When a truly external validation happens late in the research cycle, the external validation will properly penalize for out of sample variability. A useful analogy: R^{2}_\text{adj} penalizes for overfitting by subtracting p from the likelihood ratio \chi^2 for the model whereas AIC penalizes for both overfitting and out of sample variability (e.g., new design matrix) by subtracting 2p. For this discussion topic I mainly want to get across the point that split-sample validation doesn’t handle this problem in any way, and is a non-competitive form of internal validation. The bootstrap and repeated CV provide the lowest mean squared error estimates of likely future model performance when future patients can be thought of as coming from the same “stream” of patients as used to develop and internally validate the model. This is despite strong internal validation not handling the final step of validation that penalizes for study to study variability in patient samples.


Would this be the appropriate thread to discuss and post links to papers exploring the design of experiments (and observational studies) from both the Bayesian and Frequentist point of view? I think there is insight to be gained from looking at this from a formal and quantitative perspective.

This thread doesn’t deal with design of experiments per see; the only indirect connection I can thank of is that small N is one of the biggest causes of non-reproducibility. So perhaps not.

Upon re-reading your original post, I will agree that design of experiments is beyond the scope of this thread. I would like to merely add 2 references that are worth study, which discuss the important statistical distinction between the activities of “explanation” vs. “prediction”, with the caveat that much of the educational materials are devoted to the former, which are confused with the latter.

  1. Galit Shmueli “To Explain or to Predict?,” Statistical Science, Statist. Sci. 25(3), 289-310, (August 2010)

My only quibble is the author’s appending the term “causal” to conventional use of associational statistical methods in various social science disciplines. This leads to the paradoxical notion of “causal models” failing in out of sample predictive tests. Regardless, the paper makes the important point that prediction modelling requires more information (ie. effective sample size) to adequately handle various sources of uncertainty.

  1. Geisser, S. (1975). Predictivism and sample reuse. University of Minnesota. (PDF)

The fundamental thesis of this paper is that the inferential emphasis of Statistics, theory and concomitant methodology, has been misplaced. By this is meant that the preponderance of statistical analyses deals with problems which involve inferential statements concerning parameters. The view proposed here is that this stress should be diverted to statements about observables. …

This stress on parametric inference made fashionable by mathematical statisticians has been not only a comfortable posture but also a secure buttress for the preservation of the high esteem enjoyed by applied statisticians because exposure by actual observation in parametric estimation is rendered virtually impossible. Of course those who opt for predictive inference i.e. predicting observables or potential observables are at risk in that their predictions can be evaluated to a large extent by either further observation or by a sly client withholding a random portion of the data and privately assessing a statistician’s prediction procedures and perhaps concurrently his reputation. Therefore much may be at stake for those who adopt the predictivistic or observabilistic or aparametric view.

It should be noted that the popular method of cross-validation was initially studied by Seymour Geisser in the context of Bayesian predictive methods. Geisser was at least 20 years ahead of his time.


I agree that sample-splitting likely will result in biased estimation of model performance compared to repeated k-fold cross validation. When combined with UL over a sample of 4000, this may require cluster computing to be reasonable, depending on the number of repeats.

Going back to the original post, I find stability much more difficult to think about for UL than for SL. For a simple SL scenario, say focused on a regression model coefficient, it’s straightforward to think about stability of parameter estimation. But for something like a clustering algorithm, is there a concrete definition of stability? The results will not be identical in each fold, so the quantification of stability will be subjective, although maybe one could propose some objective measures. One thing I’ve seen is to look at agreement statistics for comparing cluster assignment across algorithms. But not sure how this handles that the clusters are defined differently in each fold? Do you need to manually label clusters based on similarity first?

As a side question, is stability evaluated before or after (or both) determining hyperparameters? E.g., for clustering methods where the user must specify the number of clusters, is k-fold cross validation used to pick the number of clusters, and then a different k-fold cross validation used to evaluate stability when setting that number throughout all folds?

Ignoring the last (difficult) question, there may be a way to quantify stability of cluster topologies, but the easier approach as you alluded to is to compute the frequency distribution of all the clusters that patient I was assigned to, for each I, then summaries these frequency distributions into an overall measure.

When UL is undertaken such that the final result is fed into clinical outcome prediction, but the process of UL is nowhere informed by patient outcomes, is it necessary to take the UL steps into account when doing CV or bootstrap validation of SL clinical prediction? I.e. do UL and SL need to be validated simultaneously, or can this be staged?

Isn’t failure to do this step akin to ignoring measurement error? For clustering-based UL, most procedures like LCA provide probabilities of class membership, in addition to predicted classes. This makes what were considering sound like a measurement error problem. Ignoring the uncertainty in class membership inherent in UL could cause problems, especially if the goal is to understand which latent classes differ in terms of the outcome. See, e.g, Methods to Account for Uncertainty in Latent Class Assignments When Using Latent Classes as Predictors in Regression Models, with Application to Acculturation Strategy Measures - PMC. They suggest “a multiple imputation approach based on repeated imputations of the latent class based on the vector of their posterior probabilities of class assignment” could be a reasonable approach in many circumstances, though not computationally easier than resampling-based validation. Or, perhaps something like the described EM is general enough to work in many cases?

Then again, if we only interested in a valid measure of the accuracy of the model on new data, I think this issue is less pressing.

1 Like

Fantastic points and I did not know that paper existed. Good work.

For this discussion topic I mainly want to get across the point that split-sample validation doesn’t handle this problem in any way, and is a non-competitive form of internal validation.

I certainly agree with this point, and realized this evening that Frank’s concern about making this point is not hypothetical. I just encountered the following paper that claims to “validate” a biomarker test for multiple sclerosis activity based on a single training/test set split (70%/30%). It was published this year.


I thought that the deprecating of such split-sample “validation”, in favor of resampling methods, was textbook material (eg, James et al. chapter 5):


I found this interesting paper by Julian Faraway that describes a more nuanced perspective on the issue of split sample vs resampling, that I offer up for discussion.

From reading the paper, it seems as if there are more areas of agreement than disagreement with the opinions expressed in this thread, but I’m curious if the preferences for resampling might be modified slightly after reading this.

Faraway, J. J. (2016). Does data splitting improve prediction?. Statistics and computing, 26, 49-60. (link)

From the abstract:

We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when
data reuse costs are high.

Faraway states that the use of resample methods is preferable when:

1, The set of models to be considered is able to be fully specified.
2. The set of models is small, relative to the amount of data.

He states that split sample methods are preferable when:

  1. The set of models to be considered cannot be fully specified ahead of seeing the data.
  2. The set of models is large, relative to the amount of data.
  3. Human judgement is needed for the modelling process.

He elaborates on various modelling alternatives:

A Bayesian approach that assigns priors to models as well as parameters
is possible as in [10] but the approach becomes unworkable unless the space of models is small. In many cases, the space of models cannot be reasonably specified before analysing the data. Another idea is to use resampling methods to account for model uncertainty as in [11]. However, this method requires the model selection process to be pre-specified and automated. It also requires that these processes be implementable completely in software which excludes the possibility of human judgement in model selection.

1 Like

These are very valuable thoughts. What is not mentioned is the minimum sample size for split sample validation to be stable. That will rule out many hoped-for applications of the split-sample approach. Also note that nowadays it is possible to specify more and more complex Bayesian models.

1 Like