## Background

The purpose of this topic is to generate discussion and develop guidelines for instilling rigorous and reproducible research in a collaborative network for which advanced bioinformatics and biostatistics data analyses will be conducted. Some of the analyses are done centrally and some analyses are done by methodologists at the clinical sites, using either whole-study or cite-generated data. For high-dimensional data it will be assumed that bioinformatics tools used in pre-processing have been well thought out and validated, and that possible batch effects will be studied and attempted to be accounted for in later analyses.

Data to be analyzed include both one-time assessments and complex longitudinal trajectories, and span clinical, molecular, imaging, and patient-oriented data. All the data are observational, and there are no clinical trials included in the program. In what follows, I use the term *unsupervised learning* (UL) to be synonymous with *data reduction*, both implying that outcome variables being predicted are not revealed to the analysis. *Supervised learning* (SL) means that ultimate patient outcomes are utilized during the analysis.

## Analytic Tasks

There are at least three classes of analytic tasks to be undertaken:

- Traditional SL clinical prediction modeling based on raw or pre-processed data, without UL
- UL to reduce higher-dimensional data to lower dimensions as and end in itself (to be used in the future for a variety of patient outcomes). This includes clustering of patient trajectories of multiple dynamic patient descriptors, latent class analysis, principal components analysis, variable clustering, etc.
- UL that is quickly planned to be used to predict patient outcomes and differential treatment effects on patient outcomes. This is an UL-SL mixture.

Many of the analyses to be done involve complex multi-stage procedures whose discrete analytic components have been rigorously studied, but whose overall procedure’s performance has never been studied. That makes rigorous validation even more important.

## Types of Validations

The most frequently used statistical validation procedure is split-sample validation, which is also the most problematic. The main problems with the approach is that it does not expose volatility/uncertainty in feature selection and pattern detection, and it is unstable. In other words, if one split the data 50-50 or 75-25 into training/test samples, developed a model on the training sample and attempted to validate it on the remaining data (test sample), one is often surprised how much both the trained model and the validated predictive accuracy change upon starting over with a new split, or just interchanging the roles of training/test in a 50-50 split. In one example with n=17,000 and a binary Y with 0.3 incidence (death in the ICU), repeating 50-50 split sample validation revealed substantive changes in the model coefficients and in the predictive discrimination (c-index; AUROC) obtained.

Because the research network will not have a large enough sample for split-sample validation to work properly (the maximum sample size in any one analysis will be n=4,000), validation will need to take place using either k repeats of 10-fold cross-validation (CV) or 300-400 bootstrap resamples. For larger n, k as small as 10 may be sufficient, but for smaller n, k may need to be 100. An advantage of repeated 10-fold CV is that it provides unbiased estimates of model performance even under extreme circumstances where n < p where p is the effective number of parameters estimated using SL. An advantage of the bootstrap is that the instability it reveals mimics that found in future samples, whereas CV understates volatility because two samples have 8/10 of the patients in common. A disadvantage is that the bootstrap understates overfitting in extreme situations (n < p).

Resampling-based validation is a type of *strong internal validation*, and split-sample validation may be called a weaker form of internal validation. For resampling-based validation to be strong, all analytical steps that in any way benefitted the analysis must be repeated afresh at each resample. For example, if any outcome data-based feature selection was done, the feature selection must be repeated *from scratch* for example resample. And comparing the set of features selected across resamples directly assesses the stability/trustworthiness of the feature selection process.

Besides choosing statistical validation procedures, there are vastly different forms the validations need to take:

- Standard validation of SL clinical prediction models involves using resampling to smoothly estimate overfitting-corrected calibration curves to gauge absolute predictive accuracy, and various indexes to evaluate predictive discrimination. The most clinically relevant measures of the latter are described here.
- Validation of structure, stability, and information capture in UL such as clustering patients. Validation of the
*meaning*of derived patterns requires clinical thinking, so at the post consider only statistical aspects of validation of found patterns.

For 2., statistical validations need to include the following.

- Sturdiness/stability/volatility: Do almost the same patterns emerge of repeated resampling, starting the pattern discovery process over at each sample (over a series of random tenths of left-out data, or over repeated bootstrap samples with replacement)
- Homogeneity/information capture: When the pattern discovery process involves creation of discrete patient classes, verify that the patterns are adequate summaries of the underlying measurements they represent. To be a cluster, a cluster needs to be
*compact*so that there is no meaningful heterogeneity in the cluster. This can be checked several ways when doing SL, including- Use the found classes to predict an important clinical outcome, then for each patient and each cluster compute a distance measure for the patient, with regards her cluster center. Validate that the predictive information in the discrete classes does not improve when adding distances from cluster centers.
- Use the constituent raw data used to determine the classes to predict an important clinical outcome, and test to what degree they predict the outcome after adjusting for the found categories (clusters). A “chunk test” likelihood ratio \chi^2 test is one way to test for adequacy of the classes but testing the combined predictive contribution of the component variables.
- Show that patients at the boundary of a cluster are more like the patients at the center of the cluster than they are like all the patients in another cluster.

## Challenges

- SL in developing and validating clinical prediction models is well worked out. When UL is involved there are fewer papers on validation and reproducibility. When UL is fed into SL there has been even less research.
- When UL is undertaken such that the final result is fed into clinical outcome prediction, but the process of UL is nowhere informed by patient outcomes, is it necessary to take the UL steps into account when doing CV or bootstrap validation of SL clinical prediction? I.e. do UL and SL need to be validated simultaneously, or can this be staged?
- Bootstrapping and repeated 10-fold CV are labor-intensive strong internal validation procedures. As stated above they require all analytic steps, especially SL ones, to be repeated afresh at each resample. For some problems, 100 repeats of 10-fold CV are required to obtain adequate precision in accuracy measures (even though 5-10 repeats may reveal that a procedure is too unstable to use). For sites doing complex computer-intensive analysis, the personnel and computer-time burden should be noted. Is a hybrid approach possible and will it be rigorous enough?

One hybrid validation approach is for a participating site to “learn as they go” using simple split-sample procedures, but to commit to recording every data-based decision they made along the way. These decisions must be programmable. Then at the end of development, a grand repeated 10-fold CV or bootstrap study is undertaken to document performance and stability of the site’s approach using strong internal validation. For each resample, all steps *attempted* along the way, especially when involving SL, would be repeated and the original split-sample analyses superseded by the rigorous internal validation.

Is this the best hybrid approach and is it rigorous and reproducible? Are there strategies that will save some time at clinical sites (and at the coordinating center for centrally done analyses)? Are there comments and suggestions related to anything above?