I understood from today’s lecture that one cannot look at the distribution of Y prior to analysis in order to determine an appropriate transformation for the model (e.g. log, sqrt). Can you clarify why? Does this use up “phantom degrees of freedom” if it is not informing my X?
In practice, how would you recommend the implementation of Gls without knowing the distribution of Y? If in reality Y was heavily skewed, wouldn’t the results be accurate? Am I misunderstanding something?
Thanks,
I have a question about the item 3 & 4 in 4.12.1 Developing Predictive Models.
Assume we collect the data about X and Y in different time or we collect several waves of X. Do we need to consider the potential calender time issue in the multiple imputation process?
Actually, I did not understand the spirit of using X and Y to imputate X and Y becasue we avoid using the information from Y in the data reduction of X. Following this spirit, we might also avoid using the inforamtion from Y to imputate X?
Thank you so much and wish you have a great night!
We have talked about categorization/dichotomizing a continuous variable is a not a good idea. I’m wondering what would be a good way to handle age variable when we have de-identified data where we don’t know the exact ages of people that are older than 90 years old? Using age as an ordinal variable in this case does not seem like a good idea as we would lose information. Is there a good way to handle this situation? Would using multiple imputation for ages that we don’t exactly know be a good idea in this case?
Thanks!
Actually, my team is thinking about integrating model approximation into calibrate and validate by redoing model selection for each bootstrap so any information about the problems you or your student ran into (or speculated) would be valuable! Thanks!
Anything’s possible! As things stand now you would just impute the outcome then impute the X’s. Of use one of the other methods here.
At first glance the paper looks useful.
I assume that if you are working in high dimensions that your number of subjects must be large. Then you may not need propensity scores but could directly adjust for covariates.
It’s a touch question. You can’t look “too much”. This relates to model uncertainty, e.g., your final confidence bands don’t know about the earlier abandoned transformations. The Faraway paper covers this. This relates to my quote form the initial part of the course: using the data to guide model specification is almost as dangerous as not doing so.
It is required to use Y to impute X. Regarding calendar time, this would be an ideal variable to include in the imputation model, and is a necessary one in all likelihood.
A very good question. I’ve taken the easy way out and just leave them at age 90. It would be a little better to impute to ages above 90 but you would then have to make a parametric distributional assumption because you would be extrapolating.
There is a nice discussion of colliders, mediators, and the Table 2 fallacy
here: https://discourse.datamethods.org/t/does-the-table-2-fallacy-apply-in-this-nejm-paper
Hi @f2harrell,
I couldn’t find the reference about propensity scores as a penalised method in your list of papers. You said you have a list of papers for observational studies. Can you please point me to the link you mentioned. Thank you!
Thanks, @f2harrell, for referring others to this discussion. I appreciate @PerPersvensson’s comments, suggesting that maybe there’s something else going on.
In the course, you mentioned in a repeated-measure study, baseline Y should be a covariate. How would you proceed if there is daily dosing and samples are taken several times within a day at different days. e.g. you have Day1 predose, 4hr, 10hr, and 24hr and Day20 predose, 4hr, 10hr, and 24hr. Intuitively I would’ve let predose be in the RHS and allow interaction between hr (continuous spline) and Day (categorical). But if Day1 predose is an adjustment variable, how will we treat Day20 predose?
I don’t know of a paper that has laid this out. This is akin to the fact that you can write principal component analysis as a kind of penalized estimation. I wish I had more.
A great question. I don’t have experience with that “new baseline” design. It may be covered in @Stephen 's amazing Statistical Issues in Drug Development book or in Steve Piantadosi’s clinical trials book.
Age is right-sensored at 90 years. You could use a model dealing with censoring.
If you are interested in the Propensity Score, you might find our paper (by Erika Graf, Angelika Caputo and me) of interest: https://pubmed.ncbi.nlm.nih.gov/18058851/
It is also treated in 7.2.13 of my book Statistical Issues in Drug Development and is the subjet of this slideshare https://www.slideshare.net/StephenSenn1/confounding-politics-frustration-and-knavish-tricks
We don’t have as many models for censoring of independent variables.
I was hoping you would chime in Stephen.
If I am modeling data across multiple clinical trials, would it make sense to perform CV to validate allowing me to leave out one trial at a time as a test set (in this context it wouldn’t be repeated CV), or would bootstrap still be preferred? If CV is the preferred approach in this context, is there still a way to adjust for trial in this model? Should I assume a compound symmetry correlation matrix for each trial?
Trial may be too large of a unit for the bootstrap to work well. If you can treat trials as exchangeable (as done in most meta-analysis; same as compound symmetric) you can use random effects for trials.
Thank you very much for your response. Is there a reason you suggest to treat trial as a random effect instead of imposing a compound symmetry correlation matrix for trial via Gls? I was under the impression from the course that gls is a more preferred approach? Also, is there a function in rms package which allows for random effects?
Thank you very much for your response. Is there a reason you suggest to treat trial as a random effect instead of imposing a compound symmetry correlation matrix for trial via Gls? I was under the impression from the course that gls is a more preferred approach? Also, is there a function in rms package which allows for random effects?
GLS (generalized least squares) is a good approach for continuous longitudinal outcomes. It provides nice ways to handle within-subject serial correlation as well as compound symmetry (which typically fits less well than say an AR(1) serial correlation pattern). The correlations we’re talking about are continuous in time, which is much different from how one models a categorical variable such as clinical site.
Hi Dr. Harrell. Sorry for the very basic question. I was wondering how to run model validity and calibration after using multiple imputation (using Hmisc and rms).
A good question with no easy answer. There are some notes from Ewout Steyerberg and possibly some references on the website of papers I pointed class participants to.