What is the right model of statistics training for physician researchers?


Reading through a discussion about Bayesian methods in scientific training programs, I’m prompted to ask a more general question: What is the right model of statistics training for physician researchers?.

To give a bit of context, I co-instruct with Frank Harrell (@f2harrell) the regression course in the Masters of Science in Clinical Investigation (MSCI) program at Vanderbilt. I also co-instruct Biostatistics I within the MPH program. Frank and I (and anyone we can rope into the discussion) often ask ourselves:

  1. How much of the course should be hands-on, experiential learning?

  2. How much of the course time, homeworks, and evaluations should be about conceptual understanding? How much for technical details?

  3. What degree of facility with statistical software should we target for the median student?

This discussion has implications for the previous discussion because time is a scarce resource: extra time spent with statistical software, graphics, and technical details is time not spent teaching a wider variety of topics.

For the current iteration of our MSCI course:

  1. About 40% of class time is hands-on and 60% discussion/lecture.

  2. Evaluation is based primarily on a final project in which students develop a research goal (usually predictive modeling but sometimes inference), generate a research plan, and then execute it and present the results to the class. Most students will use their own data, but some will use publicly available datasets. We do assign homework, but the TAs and I work with the students to rework and resubmit homework as many times as needed. Because of this, homework is mostly a learning tool and not an evaluation tool.

  3. We demand a high degree of facility with statistical software. Students are required to generate a number of statistical graphics, including partial effect plots, calibration curves, etc.

Frank and I constantly rethink, revise, and rework this course. We are already talking about this course again even though it is 6 months away! What is the right way to do this?

NOTE: I did not ask the question about which statistical software to use in this course. That is a topic for another day and another post. I’d like to generate discussion about how to teach regression irrespective of the statistical software. Similarly, if you want to comment on what should be taught, the link above is a better place to have that conversation. Here we are focusing on how regression should be taught.


Former student from the program here with $0.02, maybe $0.01, really (an initial thought).

My initial thought is to teach methods you feel comfortable supporting in the students who go forth & research–and only those methods. It’s much less desirable to teach approach A and then criticize those who implement approach A. I think one should treat this as one would treat a bench research method: it’s an acceptable approach, or don’t teach it. There are two ways to achieve this harmony between what is taught & what is admired or supported, I think.

One approach would be to decide that the currently taught methods are more acceptable than you believe today, and the other would be to change what is taught to something that you feel comfortable supporting.


Tom I’m so glad you opened this as a general topic. Input from a diverse group of clinicians and statisticians would help us a great deal. Regarding just the Bayesian piece of the puzzle, I think that handing out a couple of annotated case studies, showing the code, results, and interpretation, would set us on the right path, even if we don’t ask students to do this themselves (for this next course offering). At present, we do teach a Bayesian slant to everything in the course (though not any computations) because we emphasize the need for pre-specification and designing analyses based on clinical knowledge rather than purely empirical data analysis.

For those unfamiliar with Tom and my approach to the 2nd biostatistics course for rising clinical researchers, we aim very high and cover regression splines to allow for nonlinear effects, multiple imputation for missing data, and bootstrap for internal model validation. We show them how to derive calibration curves to check predictive accuracy. I think that each year the students rise to the challenge, and very effectively use these advanced concepts in their projects. I think the projects are an important part of the course.

I do remain uncertain about the best mix of understanding methods and interpreting statistical results vs. hands-on computing. But before we revamped the course we did an extensive survive to gauge current and past students’ attitudes. They voted strongly for spending a lot of time with software.


Despite the wide range of graduates from the program, I’m surprised how often I get feedback that the hands-on portion is important NOT because the students want to learn how to code, rather because it helps students understand the concepts in the discussions and lectures.


Hi Thomas- I took this course long before Frank took it over with you.
The layout of your course sounds great to me. However, I will agree with your prior students. Being able to immediately use the software and run analyses helps reinforce the concepts, and gives a basis for tackling real world problems. Simply learning the theory without the hands-on computing would not give them any basis for doing their own analysis.

Even though you aren’t discussing the software, I’ll weigh in because it is closely related, especially for those with zero coding experience. I learned using SPSS (gasp) which did allow me to immediately perform analysis while learning the basics. I actually tried very hard to learn R (yes, when Frank taught a few classes) but it just didn’t click since I was foreign to any programming (as in zero experience). Over time I have learned enough programming to transition over to R. So I think the need for early hands-on use is greater than choice of software.

Beyond the introductory courses, I think that pursuing more advanced courses (Regression modeling strategies, or a hands-on Bayes course, for example) would be a nice option for interested students.


That’s well said, Matt. It sounds like the biostats courses have changed quite a bit since we took them.


I appreciate the comments James. I think your point about offering to self-selected students special help with R up front and continuing through the month of the course is a good one, which I’ll discuss with Tom. Stata does seem to be the choice for the majority of students. But I am a bit concerned that you have to do almost as much programming with Stata as you do with R to get it to do the more advanced things we need.

Dynamic “notebooks” seem to have a lot of potential here, where under RStudio students are modified example code and quickly have new output inserted into an html report.

It’s possible that using software is not as much for giving the students to do things by themselves in the future, but rather that it brings out what is going on with the statistical methods.


I have always learned the most when applying to my own data, or a mock dataset with the same structure as one that I’ll use. Solving problems for me burns the knowledge into my brain much better than any didactic lecture. I think this is pretty universal, but others may have different learning styles.

There will always be a small subset of motivated students willing to learn R. As a non-programmer, the RStudio suite and the ability to output html results finally allowed me to make the switch. I have plenty of opinions about how non-programmers can be introduced to R and am happy to talk with you about. I am not sure that I could have used R inititially for Biostats I (circa 2004). Maybe if there was an intensive introductory programming/R course, it could be done.

Another topic that needs plenty of coverage: data management, databases, file naming, variable naming, validation, etc. I know that Dan Byrne covered this up front in our course but there can never be too much attention paid to this. I know plenty of basic scientists that don’t have a firm grasp on this and could use it. It seems important enough that it could be expanded into a separate course.

A Self-Paced Curriculum for Learning Data/Research Methods Well?

Frank’s question about whether the goal is to be able to do the analyses, or understand them but not necessarily perform them is pivotal in this discussion. Would be interesting to know what the original architects of these MS degrees envisioned in terms of competencies vs. understanding/appreciation of ideas. Was it intended to address under-resourcing of clinical investigators, as I suspect? https://jamanetwork.com/journals/jama/fullarticle/2441254


For historical perspective for what it’s worth, I believe this is the original RFA that launched the MS degrees in clinical research, the Clinical Research Curriculum Award (K30): https://archives.nih.gov/asites/grants/05-21-2015/grants/guide/rfa-files/RFA-HL-04-004.html

My thought is that the key bottleneck to getting work done as a clinical investigator is financial resources. In order to get grants, one needs to show data, and in order to analyze data, one needs to either know how to do that, or hire someone. It’s the hiring of someone that is difficult financially for early career clinical investigators. I like the idea of the MS programs being about how to think about problems, rather than how do a particular analytical procedure. But short of giving early career clinical investigators the money they need to get excellent access to people who can also think in this way & do the analyses, the MS student must learn to do. Sensible?


Wow, what a great topic!

I used to think software should be a minimal part of the offering in the first statistics course taught to medical students (or anyone, really). However, my beliefs have shifted as I wade through this career, as it should for any continual learner.

I have worked with a lot of trainees (medical students, residents, and fellows) early in my career from several different specialties (obstetrics, gynecology, cardiology, surgery to name a few). Many of them are capable of opening a software package and blundering about; unfortunately most of them seem to have better retained a handful of commands than the underlying knowledge. I had enough interactions with people who would show up, Stata open, attempting to run ANOVA before they could even explain their study question and define if they had a continuous or categorical outcome variable. That’s what led to my original belief that coding and statistical programming should not be much of that first course: I felt that it encouraged doing before people had fully developed understanding. Introducing programming too early, especially with tools that allow point and click type analyses with minimal user input, seemed to promote an attitude of “just open up Stata & start running regressions for your primary endpoint” rather than starting with careful consideration of their study question, checking whether the data they had was appropriate to test their hypothesis, crafting the proper approach to answer said question, etc.

However, I reflect further on my own learning and realize how much of it has happened by doing.

As Tom said…

YMMV, but I just don’t understand a statistical concept that well until I have actually executed it on a real dataset, and the only way to do that efficiently is through software and coding, so it’s kind of inescapable. Furthermore, without getting those reps, I don’t think any amount of theoretical teaching really sinks in well enough for meaningful understanding and retention, IMO.

This is a great point, presumably stemming from your thread yesterday, and I admit it’s difficult to reconcile; how does one construct a curricula for a good biostatistics course series geared towards physicians when so much of the medical literature is built around a handful of constructs (like p-values) that many statisticians are now poking holes in? A part of me would feel like it was a significant omission to teach introductory statistics to a medical audience without a rote covering of many frequentist methods (t-tests, ANOVA, chi-squared tests, logistic regression, Cox regression, Kaplan-Meier curves) if only because that does account for >75 percent or more of the outcomes research in medical journals. Tough question, @byrdjb.

Sorry for basically just repeating this, James, but nicely said.

This is interesting, because I always feel like some of this should be common sense to anyone that’s worked with data and/or lots of projects, but then again real-life experience says that no, these people absolutely need help defining this. It seems self evident that naming a file “Althouse_Blood_Project_Dataset_01.15.2018” is much more informative than “Blood_dataset_final_Final_FINAL_january_updates” but after you’ve seen a few dozen smart people (you all graduated medical school, for heaven’s sakes!) show up with datasets named like the latter, you do wonder.

It feels strange to start a statistics course lecturing students about how to name files, but good principles of data management definitely need to be part of that discussion from the beginning.

Great, great point.

Very sensible, yes. All great points, and leaning towards the argument that implementation does need to be taught in the courses if the objective is to empower junior clinician-researchers to get their research program running.


I failed to mention one other thing. If as a matter of course or pragmatism, one must learn to do at the same time one is learning to understand, there will be a certain joy in understanding + doing that is unlikely to go away (for better or worse).


Excellent points @ADAlthousePhD (and thanks for showing me that I can quote something)

I know where you’re coming from with this comment. However, it seems to be a universal failing early on. It is so much of an issue even with beginning biostatisticians and data scientists that Jenny Bryan has a popular talk on “Naming Things” and I think was also involved in a group that wrote a nice paper on the topic. So I don’t think anyone should be surprised that physician scientists with limited coding background have not already learned it. Even though I was taught it, my practice patterns are still evolving. I am continually re-educating people in the lab, as there is always turnover, and I can’t personally do everything myself.

In the end- yes the course should teach the investigator to do it. I feel strongly about this. They should also be left with an education on why maybe they shouldn’t always “just do it” and why their biostatistician is still needed. If they are taught well, the data they hand over will be well-organized, clean, and will require less data wrangling, and their biostatistician will like them more.


I’m Matt by the way. can’t figure out how to put that on my name…as James “Brian” Byrd knows, going by your middle name can be tricky


Good points, Andrew & Matt. “Naming Things” is a huge issue. “Red Blood Cell Count (5E+12 cells/liter)” can seem like a sensible variable name to someone uninitiated in computation. I like your points, Matt, about knowing your limitations.

However, I also would point out that in a sense there is an ongoing ‘permanent attack on the (biostatistics) present,’ to use Richard Horton’s neat turn of phrase, and literally any & all analyses you & I or most biostatisticians might perform might be immediately critiqued?


@JamesLuther Very much appreciate the “Naming Things” link. I will definitely be sharing it with the next MSCI course.

This discussion highlights one of the universal challenges of teaching: the students come from a wide variety of backgrounds with a wide variety of skills (and short comings). The challenge is to structure the class in such a way that engages and challenges the whole spectrum of students. On day one, we will have students that are very computer savvy and students that will stare blankly when I ask: “In which folder did you store the file?”, not really familiar with the idea of folders, let alone file management.

Thankfully, the advice from Dan Byrne has proven to be true. He told me when I started teaching the course that MSCI students are some of the most talented folks I’ll ever work with, and that they are fast learners. Students that struggle at the start generally get up to speed very quickly.


It’s not just this, and not just clinical investigators (though I understand this thread is about physician researchers specifically).

The entirety of biomedical sciences is becoming ‘big data driven’ and for any researcher, you’re doing both your group, your project, and your students a disservice not to train and mentor them into some amount of analytical capacity. Aside from people coming out of stats/biostats dedicated programs, many PIs (early, mid, and senior) have never been formally trained in how to approach these kinds of data conceptually or practically. Effective partnering with well-trained statisticians is good, but not a generally scalable solution - especially if those statisticians are also faculty with their own scholarly goals rather than service core/staff.

So, it also needs to be considered that these students going on to academic research careers need to be able to guide, troubleshoot, and teach others who are (also?) untrained. I would argue the practical experience is incredibly helpful to build that ability.


Hi Joanne! Great points you’ve raised.


I really appreciate the initiation of this thread, because it is a topic that is near and dear to me. Many of the responses so far have been excellent. I recently finished my epi/biostat heavy MPH at a non-Vandy program and my education was almost entirely frequentist based (except for decision analysis coursework). I didn’t even give much thought to bayesian statistics at all, because I wasn’t exposed to it at all beyond Bayes rule, which isn’t really exposure at all. Since that time I discovered the rabid statisticians–Bayesian statisticians–through Frank on twitter. Since that time I have had to work on my own to learn Bayesian stats, mostly through books by John Kruschke and Richard McElreath.

I have found that working through these concepts have not only improved my concept of statistical inference, but has also improved my intuition with regard to probability. I have only taken Calc 1 and 2 up to this point, and despite the typically rigorous mathematical background required in most bayesian courses, have been able to at least understand the underlying concepts. Additionally, while STAN and even JAGS are popular, there are R packages, like rstanarm, that allow frequentist programming knowledge to be used for bayesian analysis.

My point is that: if I can do it, I think we can expect industrious physician students to do so as well. The point of the training should be introduction to the concepts, basic analysis, and comparison to frequentist statistics so that a good understanding of why and when it is clearly superior can be taken from the course.


Excellent thoughts. To focus on one part, there are special needs of future clinician researchers who ultimately arrive at smaller medical centers without adequate biostatistics support. But for the moderate to large academic medical centers, we work and need to continue to work on scalability by having adequate numbers of biostatistician collaborators, and incentivizing even those who get their own R01 methods grants to strongly collaborate with biomedical scientists. Having great, well-trained investigators to work with is the most important incentive, along with having a bit of protected time for breathing room between grants. When I work with our MSCI grads at Vanderbilt it is a pure joy.