Best method to derive preliminary gene expression signature from a published set of prognostic genes (with Cox scores)

(1) Generate a scoring system for a set of published genes (Cox scores available) from gene expression data (fold changes) from a cohort of patients.
(2) Test is the scores associate with some defined clinical variables (e.g. disease severity, medication use, etc.)

Checking for insights on best method/strategies, especially to get feedback on my initial thoughts/plan outlined below:

  1. Use respective Cox scores to select/narrow gene list from the published literature, or from weighted scores for each gene (Cox.scores_geneA)x(Fold.changes_geneA) per patient?
  2. Determine putative signature based on statistical significance (? paired t-tests healthy vs diseased subjects’ gene expression data)
  3. Test whether the defined signature (or its composite score; Cox or weighted) correlates with specific clinical variables like medication use.

Questions: Are there established approaches in this case? Have you done similar analyses before?

P.S.- For each gene in the published set, we have the Cox score (from published literature) and fold changes (gene expression data), but we do not know how best to use this data along with the to derive and test plausible signatures in our cohort.

Thank you very much!

1 Like

You’ll find lots of studies that build prognostic gene signatures for diseases like cancer in the literature. Many of the studies I’ve seen suffer from issues like stepwise variable selection or inefficient validation strategies. Be on the lookout for these issues as you look through the literature.

From the description of your data, it sounds like you already have the Cox model since you have the \beta for each gene. You can use this model to compute the genetic risk score for each patient, and then use this risk score as an additional covariate in a Cox regression to assess the risk score’s contribution to the model’s predictive power.

In regards to narrowing down the signature, why are you trying to narrow down the number of genes in the risk score? Depending on the expression assay you plan on using in clinic, you may be able to capture all of the genes in your risk score - this would most likely yield better predictive power because you remove information by removing genes from the risk score.

If you still need to reduce the number of genes in the risk score (maybe you can only measure N genes in the clinic), I would first plot the fold changes vs log hazard ratios and see if a clear relationship exists. I would suspect the relationship would be relatively monotonic - as fold change increases, you’d expect the log hazard ratio to grow larger in magnitude. Before filtering by fold change, I would make sure you’re only considering genes with p-values/CIs that make you very certain they are differentially expressed. This should reduce the number of genes substantially and from there you can select the N most differentially expressed genes for inclusion.

A few warnings to think about:

  1. How were the hazard ratios estimated for each gene? If they were estimated from univariable models without adjusting for any covariates, I would be concerned that the estimates may be biased and not actually be as large (or as small) as they appear. It may be more accurate to identify genes with the largest fold change values and then compute hazard ratios for each gene together using a multivariable model estimated with penalized MLE (LASSO or Ridge regression).
  2. If you are using RNASeq data to measure expression, remember that read counts must be normalized to account for technical bias before they can be modeled with Cox regression. DESeq2 provides functions to normalize read counts.

Thank you very much @tomasbencomo. Even though I am novice at data analyses and biostats, your explanations and caveats make sense and are high-yield. To answer your questions,

  1. The harzard ratios were estimated by log-rank test and Cox regression modeling (Hoshida et al., NEJM 2008; doi 10.1056/NEJMoa0804525). The resulting signature was validated; I note the kind caution you make about inefficient validations in the literature. I would welcome your insights on the validation method used in the above cited study.
  2. To clarify, we have the β (which I called ‘Cox score’) for each gene as reported in the above study. Since we are applying these putative signatures on expression data from non-cancerous/pre-cancer tissue samples, could we still use the specific model described in the study to compute genetic risk scores as you suggest? If so, I presume we would collect and apply the specific covariates described in their model to obtain such scores. Is that so?
  3. I agree with you that it is best to use the entire gene set if possible. However, if I understood you about narrowing the list to say N genes, I could plot the fold changes vs log of β (from literature) per gene. Then, I would look for meaningful relationships before I decide to select the N most deferentially expressed genes for further inclusion in a Cox regression model. Is this right?
  4. We have both RNA-seq and micro array data from two different cohorts of the same disease/condition. We are hoping to get a preliminary signature from the published genes (biologically plausible) for a grant while our biostatisticians delve deeper to generate potentially new/independent signatures. Therefore, I will note your advise about ensuring we use appropriately normalized data.

Again, I cannot thank you enough for sharing your expertise and the resources along. This will help me move our work forward!


The study you referenced uses leave one out cross validation to internally validate their model. This validation strategy suffers from high bias and I’d recommend using bootstrap validation as described in chapter 5 of Frank Harrell’s Regression Modeling Strategies course notes. Bootstrap validation requires less samples and yields a less biased estimate of model performance. The R rms package provides a function, validate, that computes bootstrapped optimism estimates to evaluate internal model performance. Hoshida et al only seem to show p-values and Kaplan Meier plots to evaluate model performance – you should make sure to measure the discrimination (c-statistic etc) and calibration (calibration curve) of your model as well. Finally, remember that internal validation is no substitute for external validation. There are plenty of papers online that describe how to externally validate prediction models.

To answer your second point, yes as long as you think the \beta's apply to your prediction scenario you should be able to repurpose them for your own uses. You would use the gene signature \beta's to compute a risk score for each patient (simple Cox regression). This risk score can then be used as an additional covariate in your overall prediction model. Some people may recommend dichotomizing the genetic risk score into “low” and “high” risk groups – don’t do this, as dichotomizing the variable will throw away information. Leave the risk score as a continuous variable. See this post about quantifying the new information added by the genetic risk score.

You’re correct about identifying which genes to include in the model. In regards to the data sources, be careful about combining RNAseq with microarray data. The two are not directly comparable as microarray data has a limited dynamic range while RNASeq data does not. I have never tried combining these two types of data before – you should do a literature search to see if anyone has. If you can’t find anything about combining these data types, I would pick one and focus on that type of data. If you are planning on deploying the model in clinic, I’d make sure to pick the assay that will be used in the clinic (if you fit your model to RNASeq data but microarrays are used in clinic, your model may not be directly applicable or could suffer performance issues).


Can you expand on this? I’m surprised to hear loo-cv referred to as suffering from “high bias” given a fast approximation of it is the cornerstone of so much (seemingly successful) model selection work from Aki Vehtari et al/is really the default model model comparison method in brms and rstanarm. The linked course notes don’t seem to speak to loo-cv but seem generally supportive of cv methods (preferring bootstrap but I couldn’t find anything suggesting bias). Flipping through RMS, CV and BS seem to spoken of generally together and the Efron paper ( mostly seems to focus critique on variance (same as RMS).

1 Like

Thanks @timdisher for catching my error - I confused bias with variance. It’s my understanding that leave one out cross validation (LOOCV) is unbiased but suffers from high variance. The book version of RMS references the same paper and says that Efron finds k-fold cross validation to be more accurate than LOOCV, I believe due to LOOCV’s higher variance.

The main idea I was trying to convey is that bootstrapped optimism estimates like Efron’s .632 method perform better for smaller N – LOOCV and K-Fold CV will require more samples to achieve similar performance so it’s better to use bootstrap, especially when we are dealing with small sample sizes.

1 Like