I’m still catching up on the 2020 videos. In Day 1 video, you made brief mention of generalized additive models. I have dabbled in the mgcv package a bit, and read a few books, trying to understand them. How do GAMs, as implemented in, say, mgcv, differ from regression with natural cubic splines, as discussed in the RMS course and your book? What are the relative merits of each? When would you use one versus the other method? Thanks.
Toward the end of the 2020 Day 1 video, you said something along the lines of “if an interaction emerges from the analysis, you should be able to explain in right away” (paraphrasing only). Meaning you should not be surprised by interactions, and you really should specify them in advance. I had to chuckle. I worked with very bright PA in the Air Force, who used to say, about skin rashes and skin biopsies: if you don’t know what it is before you send the biopsy specimen to the pathologist, you won’t know what it is when you get the report back. Meaning a shot-in-the-dark biopsy, with no preceeding theory or hypothesis, is unlikely to be helpful.
Wonder if you could comment on something I see in the environmental health literature about air pollution. A very common thing is to use air concentrations of various pollutants as predictors of various health outcomes. One could specifically deploy air monitoring machines for the purpose of a study. But often such studies use pre-existing data collected from routine, ongoing sampling by fixed-location monitors (like the EPA and other governmental agencies operate.) The trouble is, these machines are few and far between. So to get an input variable to represent the exposure experienced by any particular person or neighborhood or city, the researchers use some pre-existing deterministic physical-chemical-meteorological model to predict an exposure surface for a region, based on the measurements at sparse and distant location(s). This is often a non-stochastic model. Then they use this exposure as a predictor in the health outcome model. It seems to me that they are using the output of a model as a predictor in another model, as if if was a measurement when really it was not. I think there must be some error around that exposure “measurement”, but I rarely see that error being included or accounted for in the health outcome model. Do you have any thoughts on this? Thanks.
Hi Dr. Harrell.
I am working with a dataset which has a lot of missing data and will also require dimension reduction (looking at 102 predictors with sample size of 550 and binary outcome with limiting sample size of 215). Naturally, I would tend to perform MI via aregimpute
. However, I have also learned from your course (and case study in RMS book) that a good way to perform dimension reduction is to first transform variables to maximize correlations between them via transcan
and then cluster them (e.g. via varclus
) and then model the clusters. But it also seems to me that transcan
automatically imputes the data (using a method that does not perform as well as aregimpute
). Is there a way to use transcan
to transform your data that was already imputed via aregimpute
? What would you normally recommend in such a situation?
Thanks,
In my experience, GAM results do not differ from parametric splines enough to be worth the extra computation time or to make up for not having a prediction equation at the end. The main advantage of a GAM is that it can save a bit of time thinking about how many knots to put in a spline, because it can use cross-validation to determine the amount of smoothing to use.
It depends on how well thought out or how well validated is that model. If it’s based purely on physics then it depends on the physical laws being true and the measurements not having bad errors.
An empirical approach would be to use distances from measurement devices as decay factors (interactions) with exposure volumes.
Good question. I mainly transcan
for single imputation, when the fraction of records with any NA
s is very small. For multiple imputation it doesn’t account for some sources of uncertainty that aregImpute
accounts for, but it will work reasonably well as long as the fraction of NA
s is not very large.
Thanks.
If I would be using transcan
for multiple imputation (and transforming of variables) can I then include the outcome in the imputations? I was under the impression that one shouldn’t include the outcome for single imputation but for multiple imputation it would be ok. Although I think I recall reading in your book not to include outcome when using transcan
. Was that only if I’d be applying single imputation?
I am 0.84 certain that exclusion of the outcome variable pertained only to single imputation.
To clarify, I plan on using transcan
to simultaneously impute and provide optimal transformations of my predictors in order to cluster them via varclus
. Would it be ok to include outcome in my call to transcan
, or here it’s a problem anyway since I’d also be using outcome in optimizing the transformation (even if I specify not to transform the outcome itself)?
If this is a problem, is there a way to use outcome in the imputation process, but yet use only the predictors in making the optimal transformations?
Thanks.
For that purpose you’ll have to drop the outcome variable from the analysis. You are using single imputation, it appears.
Thanks! I didn’t realize it, but yes I am.
Putting transformations aside, is there a way to perform dimension reduction (e.g. PCA, sparsePCA) with multiple imputation? In your book you have a case study where you apply PCA and sparsePCA with single imputation. Do you think it’s safe to apply single imputation with let’s say 30% of rows containing at least one missing parameter (perhaps since each imputed variable is only part of a cluster anyway the method of imputation isn’t as important since its impact is “watered down”)? How would you combine information across multiple imputations for PCA (wouldn’t the components/clusters be different in each imputed dataset, intuitively I’d think you can’t simply average the coefficients since the clusters wouldn’t match)? Are any of these methods compatible with fit.mult.impute?
Thanks.
I don’t think single imputation is a good idea with more than 0.05 of observations having missing values. I don’t think there are many tools that simultaneously handle imputation and data reduction. I hope that someone responds with a pleasant surprise about a method I’ve missed.
Thanks.
Is there any alternative approach you’d recommend in the context of low n:p ratio (e.g. limiting sample size of 215 with 102 predictors) with 0.3 observations with missing values? Are you better off using ridge regression with multiple imputation, or single imputation with PCA? Is there a third alternative option you’d recommend?
Until someone comes up with a more efficient approach, think about doing 30 imputations and 30 data reduction analyses then code what Hmisc::fit.mult.impute
does to combine the 30 data reduction results into an average of 30 analyses.
I’m running a bayesian proportional odds regression using blrm. One of my covariates is an ordered factor. The treatment contrast for this covariate is using the mode as the reference level. How can I force it to use another level (the base level)?
Run the contrast()
function (contrast.rms
) on the blrm
fit object. This allows you to take control, e.g.
f <- blrm(...)
contrast(f, list(treatment=c('A','B')), list(treatment='C')) # compares A and B with C
contrast
also has many options to take advantage of Bayes, e.g., to get posterior distributions of differences in predicted means, quantiles, or exceedance probabilities.
Thanks. That works well. Is there an easy way of taking the output from contrasts and plotting a chart of the odds ratio with limits such as
f <-blrm(...); plot(summary(f))
With summary(f)
you can specify reference values, i.e., to compare with treatment='A'
use summary(f, treatment='A')
. But this won’t allow you to compare derived quantities or to have more general contrasts. Yes you can take contrast
results and pass to ggplot2
and other plotting systems as shown in the help file. Type ?contrast.rms
.
What is the recommended approach for multiple imputation when the outcome (e.g. survival time) is censored? Typically we include outcome in our imputation model, but I am not sure how to incorporate the censoring indicator. Should I make an interaction between censoring and survival time? Is there a different recommendation?
Thanks.