RMS Course and Uncategorized Questions

f2harrell · September 19, 2021, 12:48pm

Regression Modeling Strategies: Course Design and General Questions

Course Design

In the introduction, concerning prediction vs classification, you state: “Far better is to use the full information in the data to develop a probability model, then develop classification rules on the basis of estimated probabilities”. Would you be able to show us how to do this? FH: See the Diagnosis chapter in BBR.

Would it be possible to have some kind “simple” overview flowchart that works through these principles. Essentially a simple guide from beginning to end (with emphasis on the principle discussed in this course) so one can have a framework to work with (I know I am asking for an oversimplification of more complex approaches and there will be contextual variations)

Sample exercise with codes and interpretations on the results and decision making can
be helpful

To echo above I would also appreciate some sample exercise with codes etc to work through the learning in more own time. I would love a reference standard FH/DL example…
Would greatly appreciate that too. FH: Our long case studies will try to do this in a sense.

Very minor thing but I like how above Drew responded to queries in a different colour. This make the Q&A more readable in my opinion FH: Great point - I’ll try to start doing that.

It would be helpful to put into practice each idea with immediate examples, not waiting for the case studies. A small practice feedback would strengthen the theory. FH: This is always a challenge for me. I tend to wait for long case studies but need to figure out a way to bring pieces of the case studies earlier into the discussion. Any ideas welcomed.

If we want to continue the conversation about a question above (e.g. ask another clarifying question), it would be good to have a way to flag that question as needing another response. At the moment (totally understandably due to there being many questions) I don’t think Frank and Drew would notice when another response is needed on an earlier topic. How about we add a highlighted ? at the beginning of the question? Yes or at the end.

Might be helpful to port the case studies to bookdown or R markdown notebooks that could be provided to everyone. At least for me, copy pasting the latex generated code into R doesn’t work quite right (inappropriate spaces are inserted and the arrows don’t transfer properly). +1, I had the same issue. +1 +1:I think it might be more useful to have the original codes in a .Rmd file than .r file for better organization and ease of viewing results from each section. The RMS web site has a link to get all the code, separate by chapter. Available here for anyone who didn’t find it: RMS

General Questions

Not prediction related, but since you mentioned p-value adjustment when you deal with multiple primary outcomes. What would you suggest doing in such a case? (Assuming that it is not possible to only choose one primary outcome) FH: See http://hbiostat.org/papers/RCTs/multEndpoints/coo96mul.pdf

Imagine we would like to build a prognostic model and we have a variable that we know from the literature has a great predictive effect (e.g. compliance to therapy) but we cannot measure that a priori. Would it be appropriate/possible to build another predictive model assessing the risk (to have for example non-compliance) and then include this risk variable as predictor in the prognostic model? I don’t see anything wrong with including logit(risk) or log(-log(risk)) in the model (depending on whether the model was logistic or Cox).

Can you allow for different hazard function from an event to event (if the intensity of the second event can depend on the outcome of the first…like it occurs with pulmonary exacerbation) There are many methods for handling gap between events; see for example the vignettes that come with the R survival package. Or this may be a good place for a Markov model.

Circling back to the idea that software matters, I’m beginning to be worried by the impact that Rstudio and those associated with the tidyverse are having on the direction of R, and that packages like yours are not being as widely adapted as those from that group. Do you share this concern? (And does tidymodels actually “encourage good methodology and statistical practice”? See https://www.tmwr.org/.) Do you have plans for how your software will be carried on if you were to (as the saying goes) get hit by a bus? Would you like your software adapted into the tidyverse framework and would it help for fans to advocate for that? I share your concerns. I don’t worry so much about “tidy models” because they are restrictive. I worry more about the tidyverse as an inefficient way to program for data manipulation. RStudio has done wonderful things for the community but the tidyverse has actually set the community back and has instilled bad programming practices. My stuff is all on github so anyone could take over maintenance of it.

Unclear Topics in Need of More Work / Information

How to use simulation to validate data? How do we select parameter coefficients and type of distribution. I don’t know how to use simulation to validate data, but use it to validate estimation procedures. Re-simulation is very handy: fit a flexible model (full regression model, random forest, anything), freeze the model parameters, use them to simulate new Y values, refit to new Ys, repeat.

In Section 3.5, the introduction to strategies for missing data, it wasn’t clear to me if the bullet points were different ideas (it seemed to start out that way) or ideas that built on each other (predictive mean matching is built on earlier ideas, I think??) I think they are separate ideas.
Clear Topics You Especially Resonated With
I liked the beginning section that went over the ‘philosophy’ - it seems very valuable to understand where the instructor is coming from generally.

I think the ability to ask questions and the willingness of both instructors to engage with everyone (which can be tricky over the internet) has been awesome. Ditto - Charles Reynard

Not a single topic but you could draw an acyclic graph with passion for material as one node leading to joy of teaching as the other. Both instructors epitomize the old aphorism “ What a man does for himself alone, dies with him, what he does for others and the world lives on and becomes eternal.” You made my day!
Thoroughly enjoyed the whole course. It’s given me so much immediately usable information as well as tonnes of motivation to go way and learn more. Thanks so much.

Is the logic behind the benefits of having a similar study using Bayes related to knowing the effect size of the phenomenon of interest? The main benefit is getting direct evidence, but use of prior information is another major benefit. If you have trustworthy, relevant studies, their results can help build an informative prior for the new study, effectively boosting the sample size.

+1 to Nathan’s request. I see good examples (like FH’s examples) of simulation, but it would be nice to have a strong comprehensive treatment of how to do it in R. +1

What are the best books on simulations, regressions and Bayesian analysis, longitudinal data? +1
Can you give us references of clinical papers that you have been involved in, that presents partial effects plot?Also clinical papers that use spline models, ordinal regression etc? I just wanted to see how what we learned is applied to publications +1
http://hbiostat.org/papers/feh/

Can the R package coin give exact Wilcoxon test statistics in the presence of ties? The description for the wilcox_test and other non-parametric tests states: “oneway_test, wilcox_test, kruskal_test, normal_test, median_test and savage_test provide the Fisher-Pitman permutation test, the Wilcoxon-Mann-Whitney test, the Kruskal-Wallis test, the van der Waerden test, the Brown-Mood median test and the Savage test. A general description of these methods is given by Hollander and Wolfe (1999). For the adjustment of scores for tied values see Hájek, Šidák and Sen (1999, pp. 133–135).” I recall seeing the claim in an introductory R book that this method does give exact statistics. Did the author oversell the package? I am ⅔ confident that coin can give exact p-values in the presence of ties. It will be slow for large datasets.