A Self-Paced Curriculum for Learning Data/Research Methods Well?

tl;dr: What do you think biomedical researchers and clinician-scientists should know about modern data methods? And, what specific resources would you point them to learn those concepts and skills? [Preferably open-source or free, but ultimately whatever resource is best].

Hi all. First time post here. I’m an MD-PhD student interested in improving my ability to design experiments and derive insight from my data.

A recurring theme I see on this forum is how many biomedical scientists and clinicians do not have a good understanding of the concepts and methods they need to properly analyze their data and derive accurate insights.

Given the experience and wisdom of the people on this site, I thought it would be worthwhile to start a discussion around what a remedy to that situation would look like. Indeed, I was inspired by this thread and want to follow up on the specific details.

What do you think biomedical researchers and clinician-scientists should know about modern data methods? And, what specific resources would you point them to learn those concepts and skills? [Preferably open-source or free, but ultimately whatever resource is best].

That is a big challenge for people like myself who want to learn and use best practices. There is a glut of seemingly excellent resources on Coursera, edX, and around the web (e.g. R4Data Science). But knowing which ones to use and in what sequence is challenging. Also prioritizing certain aspects is a challenge. What things are high yield and absolutely critical and what things are nice to have but more niche?

So, if you had to come up with a sequence of concepts and skills (with accompanying resources and their applications), what would that look like? I know there is no one right answer, but hopefully with everyone’s proposals, we might converge on a core set of stuff that people can learn from.

I look forward to hearing your thoughts.


I’m a MD, not PhD though who also wanted to learn more about statistics for the past years. I really ended up liking both statistics and epidemiology a lot and perhaps my self-paced curriculum will be too in depth for your needs but I hope you get some interesting insights from it.

I started with the the Coursera Data Science Specialization from Johns Hopkins and the Statistical Reasoning for Public Health I and II courses there as well. (nowadays I would probably point you towards R for Data Science or a subscription in DataCamp to quickly learn how to use R for clinical research)

For Statistics 101 I used both the OpenIntro book and An Introduction to Medical Statistics by Martin Bland.

For applied regression methods, the book from John Fox Applied Regression Analysis and Generalized Linear Models along with the R Companion for applied regression. If you want more applications in R for regression both Linear Models with R and Extending the Linear Model with R, by Julian Faraway are great as well.

When you want to dive deeper in strategies to make regressions models behave when you encounter uncooperative data, there isn’t a better book than Regression Modelling Strategies by Harrell himself.

If you want to turn to the dark side and start doing Bayesian Statistics, there is no better introductory book in my opinion than the brilliant Statistical Rethinking by McElreath.

Finally, I think all physicians should have epidemiology knowledge at the level of Gordis Epidemiology or Rothman’s Epidemiology: An Introduction (this one is brilliant, concise and hits many good points on bias and errors).

If you want to dive a bit deeper in the basis of statistical theory itself, I would recommend An Introduction to Probability by Bertsekas and Tsitsiklis for probability theory. So far, I haven’t yet encountered a Mathematical Statistics book I truly recommend so I’m all up for suggestions on that side as well :slight_smile

Hope I have been helpful, there are so many interesting statistics books nowadays that it’s hard to find suggestions that please everyone :stuck_out_tongue:


Also, more important than reading all those books, find a place to learn from the best and where you can apply new methods you learn as well as get some feedback on how to use them correctly.

I was a volunteer during medical school in the Biostatistics laboratory of my Hospital and ended up learning a lot from the statisticians there as well as having the pleasure of working in many interesting projects so try to keep an eye open for simple and interesting clinical questions that may arise during your practice and that you can solve with your knowledge. It is the best way to learn in my opinion.


So far, I haven’t yet encountered a Mathematical Statistics book I truly recommend so I’m all up for suggestions on that side as well

I have found the book of Jay Devore & Kenneth Berk “Modern Mathematical Statistics with Applications” (2nd ed.) to be quite readable, even for non-mathematicians. A good reference in my opinion, albeit more technical is the book of Keith Knight “Mathematical Statistics”.

Another very good Statistics 101, especially for epidemiologists and clinicians is “Essential Medical Statistics” by Betty Kirkwood & Jonathan Sterne.

I can heartily support all your other suggestions.


Will have to check on Devore’s book then. My problem with Mathematical Statistics books is that I think they get too lost in the theory itself and don’t make enough connections to applications like probability books do so by the title, it already looks promising :slight_smile:

I also heard from other clinical colleagues that Kirkwood’s book really made the connection between Statistics and Epidemiology visible so I guess I will have to check it as well.


Thanks for the recommendations @jafp and @COOLSerdash!

Any thoughts on machine learning resources? I’ve heard introduction to statistical learning is very good.



I don’t have much experience in ML and I honestly think it is a bit overhyped in medicine but the two books I always hear about are ISL for Theory and Applied Prediction Modelling by Kuhn et al. for practice :slight_smile:
Other people probably have better suggestions

1 Like

Hi @achamess and everyone (1st time poster)
The Statistical Learning course is now available self paced here https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about
It is a good course on regression in general


Here’s a compilation of the best resources I’ve used over the years. (I’d put more links below, but my ‘new user’ status permits only two…)


Machine learning:

To highlight a few specifics, I refer most fellows and residents who want to understand statistical methods without a lot of math to Kirkwood and Sterne’s Medical Statistics. This provides an overview of epidemiology in addition to the high-level contrast among different approaches.

The best Bayesian statistics book (maybe the best resource I’ve found overall) is Kruschke’s Doing Bayesian Data Analysis. The second edition, which contains high-level code to run rstan and rjags is really useful, and this not only helped my understanding of Bayesian stats, but understand probability distributions in general.

Finally, An Introduction to Statistical Learning, by James, Witten, Hastie, and Tibshirani, is a must-read for anyone who wants to delve into machine learning. The explanation of bias-variance trade-off changed my approach to modeling, and this book started me down the road to learning about a lot more ML and deep learning methods.

I’ll also toss in that finding a user interface/programming environment is critical if you’re going to go it on your own. I started out using Stata IC, which is great because it has dropdown menus and an outstanding help manual that covers both practical and theoretical issues. Along with this, the UCLA website–stats.idre.ucla.edu/stata/ contains tons of code examples, which I often used when reading about various methods, such as survival, longitudinal, and time series analysis in textbooks. If you want to use R, then RStudio is a must, and for Python, Jupyter notebook or Spyder via the Anaconda platform are really useful. Finally, if you want to delve into deep learning, the keras package (via Python) is almost too easy to use, and has a great support pages and online examples.