That’s a reasonable interpretation though I think you might sometimes reach the opposite conclusion. Better would be to use 4.1.2 after doing chunk tests of groups of collinear predictors. (But what to do with the results? Perhaps use sparse PCA instead of all this?).
Thank you for the reply Professor, much appreciated!
Hi Prof Harrell,
I want to confirm my understanding about RMS 4.12.1 Developing Predictive Models.
In this section, you said that we should use a a single p value to test the entire/non-linear/interaction.
For example, if a single predictor has a > 0.05 P value, we should NOT delete it, we need to check the global P value. For an interaction term, we should observe the global interaction P value, if the global P < 0.05, we should NOT delete any interaction term, even there is an interaction term with a >0.05 P value.
(Prerequisites: variables and interactions are based on pre-made assumptions, that is, if there is a physiological mechanism for the interaction between two variables, an interaction should be included)
I have marked my understanding in bold at the end of your notes. In summary, we should not exclude any interaction or predictor just because its p>0.05, we should focus on the global p-value
8 Can do highly structured testing to simplify “initial” model
Test entire group of predictors with a single -value (⇒ TOTAL P value in ANOVA test)
Make each continuous predictor have same number of knots, and select the number that optimizes AIC
Test the combined effects of all nonlinear terms with a single -value. (⇒ TOTAL NONLINEAR P value in ANOVA test)
- Check additivity assumptions by testing pre-specified interaction terms. Use a global test and either keep all or delete all interactions (⇒ TOTAL INTERACTION P value in ANOVA test)
Is my understanding correct? Thank you very much.
Your understanding is correct up to a point. There is a serious question of whether you should ever remove part of the model just because it’s “non-significant”, which actually means very little in any context. What is particularly bad though is removing part of a variable that spans multiple parameters, i.e., removing a single “insignificant” parameter that is connected to other parameters in the model.
With regard to interactions, since they are so hard to estimate we need to move to a Bayesian approach where priors are carefully specified so that interaction terms are “half in and half out” of the model.
Thank you very much
. Do you have an example of the Bayesian approach you mentioned for dealing with interactions?
See this.
Prof. Harrell,
Is there an option to fit negative binomial model in RMS? I could not find the option in ‘family’. The stackoverflow forum suggest to use:
library(rms)
library(MASS)
Glm(counts ~ outcome + treatment, family = negative.binomial(theta = 1))
##General Linear Model
rms::Glm(formula = counts ~ outcome + treatment, family = negative.binomial(theta = 1))
I think the answers in stats.stackexchange.com are pretty complete. I would have to implement something special for \theta for rms to give you a general solution. Code contributions welcomed. In the meantime try to use glm or one of the other functions along with the effects or marginaleffects packages.
I’ve let the lectures of the past two days sink in and have a probably very basic / general question relating to this take home message:
Assemble accurate, pertinent data and lots of it, with wide distributions for X.
I assume it’s often better to incorporate data from a different setting than the question, but from which we could “borrow” some information for our question of interest?
For example, let’s assume we are interested in whether a biomarker, which could be used for trt guidance (not diagnosis), can be detected in a patient, who is known to have disease X.
The diagnostic test is very expensive, so we’d like to estimate a probability for a person of being positive for the biomarker before they start treatment. In our sample we have data on patients before they started treatment and under treatment, including the biomarker at regular intervals.
- If we are in a data limited situation, would it be a good idea to include these follow-up observations, include covariates about where the patient is on his/her treatment journey (and of course account for intra-person correlation)?
- Would you assume that this should in general improves our estimate of the baseline probability?
- Could we test whether this is a better approach for our data without double-dipping?
I also assume it’s probably a good idea to model the biomarker as ordinal instead of binary to have a more informative Y, even though we are clinically interested in the dichotomization?
Sorry, many questions.
These are good questions. When discussing adding extra information from outside the current dataset, we usually speak in terms of specifying informative priors in a Bayesian analysis or using Bayesian joint modeling of two datasets. I don’t think you’re referring to that.
This is a very useful way of thinking that is often ignored. Often the prediction of a test result from standard baseline variables is worth a paper in itself, to quantify the redudancy of the new test. That is for the case where the test is only done at baseline. But it stands to reason that if you can predict the baseline value you’ll be able to predict the value at other time points. The general question is whether the biomarker is providing new information about potential treatment response, and it’s usually the case in biomarker research that the early research is afraid of adjusting for too many patient variables. Instead the bar is set low for finding biomarkers that are predictive, without every verifying that the biomarker provides new information.
I’m not clear about whether this is a good idea. The one setting where you do this is multiple imputation where it’s OK to use future responses to help impute past measurements. What would be useful in general is to try to predict biomarker trajectories and not just the first measurement. If you had enough data with everyone followed the same duration, you could summarize each person’s biomarker trajector with things like AUC, time of maximum marker value, etc., and try to predict those from baseline non-biomarker data.
Definitely. Once you predict the biomarker as ordinal you can instantly estimate the probability that it exceeds any chosen cutoff. We try to use full information at every step in the analysis to increase reliability, stability, precision, and power and decrease overfitting.
Yes, I didn’t mean that.
Yes, I agree. The case I’m thinking about is slightly different as it’s actually cell free tumour DNA / a cancer’s mutations found in the blood stream which could give us more information about response to a specific treatment. But the allelic frequency of these mutations is sometimes just used as a biomarker to gauge progression / response, something which might not give us more information than existing clinical variables.
I was aware of this for multiple imputation, but is it really the same. For MI I must include the outcomes to impute baseline variables. But in this case I have the outcomes for T1 and covariates measured at or before T1. I also have the outcomes for T2 and the covariates measured at or before T2, and so on. Are we really in the same situation?
Got it. Thanks!
If you are interested in how we did something similar, here we published a multivariable prediction model of a costly biomarker for progression, to aid in deciding whether to obtain this biomarker. In short, monoclonal gammopathy of undetermined significance (MGUS) is an obligate asymptomatic precursor to multiple myeloma (MM, a cancer we care about). All who develop MM, first have MGUS. Most who have MGUS never progress to MM. MGUS is easily detectable from blood laboratory measurements. There exists an intermediate stage between MGUS and MM, so called smoldering multiple myeloma (SMM). This is defined by a plasma cell percentage of 10% or higher in bone marrow. The risk of progressing from SMM to MM is roughly ten-fold higher than MGUS to MM and is therefore of interest for more intensive monitoring. Bone marrow sampling is a procedure done at specialized centers, can be painful and anxiety provoking, and is orders of magnitude more expensive than testing or following MGUS. Our model predicts the risk of SMM given the commonly available (and cheaper) laboratory parameters. We show that you could very safely not perform a bone marrow sample in the vast majority of individuals with MGUS, depending on your risk tolerance.
Hello Dr. Harrell,
I’m currently an undergraduate student in statistics and working as an intern at a quantitative investment fund. As part of my work and studies, I’ve been exploring some of the statistical challenges that arise in modeling financial time series, particularly in settings with low signal-to-noise ratios and limited data. Your emphasis on principled modeling and robust inference has been very influential in shaping how I approach these problems.
I would greatly appreciate your perspective on a few questions below.
Context: Analyzing daily financial data with low R², low signal-to-noise ratio, and relatively few data points.
- Modeling under low signal-to-noise ratio and limited data:
In scenarios with low signal-to-noise ratio and relatively small datasets (e.g., ~5 years of daily financial data), what are the best practices to avoid overfitting while still leveraging most of the available data? Are there any guidelines for splitting the data into training, validation, and test sets in a time series context? - Choosing a time windowing strategy:
When modeling financial time series that may exhibit structural changes, how would you compare the merits of rolling windows, expanding windows, or a combination of both for estimating model parameters? - Single models vs. ensembles:
I understand you generally favor interpretable, parsimonious models. Are there any circumstances under which ensemble methods (e.g., bagging, boosting) might be justified from a rigorous statistical standpoint? - Skewed payoff distributions and signal evaluation:
In finance, average directional accuracy may be misleading. For instance, a signal that correctly predicts the direction of returns only 30% of the time might still be profitable if the magnitude of correct predictions outweighs the losses from incorrect ones (i.e., there’s a positive skew). In such cases, metrics like correlation or MSE seem inadequate. How would you evaluate predictive signals when returns are asymmetric? Would bootstrap methods be appropriate for comparing signal performance in this context? - Volatility clustering and residual stability:
Financial time series often exhibit volatility clustering, which affects both volatility estimation and model residuals. In a simple regression context, does it make sense to fit the model within windows that adapt to these clusters to ensure residual stability? Or would you recommend applying time-varying shrinkage to the coefficients instead? - Shrinkage under temporal dependence:
When dealing with limited data that exhibit temporal dependence (even without strong autocorrelation), how would you approach coefficient shrinkage? Are there shrinkage methods particularly suited to preserving temporal structure? - Intraday data, autocorrelation, and valid inference:
Intraday financial data provide many more observations, but often come with strong autocorrelation. How should one adjust inferential statistics, such as the t-statistic, to avoid spurious significance? Are there more robust methods to assess the relationship between signals and returns in this high-frequency, autocorrelated setting?
Thank you very much for your time and for sharing your deep knowledge so generously through your work and writing.
My experience does not extend to time series, so I hope that others who are versed in time series will respond. I’ll just give some general observations.
- I would not do any inefficient data splitting and instead either fit an extremely flexible model or using something like the moving block bootstrap for studying model performance and overfitting.
- You’ve suggested a multi-step approach with different windows etc. This becomes fairly ad hoc and does not represent fully reproducible research. Instead I would spend most of the time with model formulation.
- One basis for a flexible model would be a semiparametric model with a long-term time trend, short-term periodic trends, and various Markov components, e.g., a 2^\text{nd} order Markov process could have lag-1 and lag-2 values as predictors, after splining them.
- You can make the correlation pattern more general than a Markov process by also add as predictors various summaries of past performance, such as Gini’s mean difference to capture volatility, and 9 deciles or at least the median past performance.
Julian Faraway (whose work has had a strong influence on Dr. Harrell’s regression modelling strategies) had written a 2016 paper that discusses the contexts where data splitting might be worth considering:
Faraway, J. J. (2016). Does data splitting improve prediction?. Statistics and computing, 26, 49-60.
A link to the paper is available in this thread.
Because of the lack of data and non-stationarity of time series, you are going to need to work backward from what is taught in financial theory textbooks to the actual market context in which you work.
I recommend a study of the book Expectations Investing, where you use financial theory to infer what factors investors are discounting. You can then apply your stats background to ferret out contrary information that suggests the current consensus is wrong, and how to implement it.
Recommended Reading
Johnstone, D. (2018). Accounting theory as a Bayesian discipline. Foundations and Trends® in Accounting, 13(1-2), 1-266.
Dr. Harrell
I am in the early stages of data exploration.
I have already performed a chunk test to examine whether time has a non-linear relationship with the outcome. The Chi-square statistic supported the inclusion of time as a non-linear variable, so I’m modeling time as a spline with 4 knots.
Given what I learnt about time and outcome relationship from the chunk test, does it make sense to further compare models:
y ~ exposure * time
y ~ exposure * ns(time)
when we already know that time has a non-linear relationship with the outcome?
-Clif.
That analysis would be redundant with what you have already done.
Think hard about the analysis strategy though. Except in special cases we don’t assess the importance of nonlinear terms except to satisfy our curiosity. We don’t do it to decide to keep or exclude the terms. The best strategies either pre-specify the model on the basis of subject matter knowledge, or fit the most flexible model that the effective sample size of Y will support, then call it a day.