R vs. Python in academia

jarbet · October 13, 2024, 6:45pm

Hi everyone,

I’m curious to hear your thoughts on the importance of learning Python in academic biostatistics or closely related fields (e.g., statistics, bioinformatics, data science).

Background:

I’ve been using R for over 10 years and haven’t encountered a problem I couldn’t solve with it. I’m highly proficient and comfortable with R.
However, I’m wondering how essential it is to learn Python for career advancement in academia, especially in biostatistics.

Questions:

For someone like me with extensive R experience, how important do you think learning Python is for an academic career in biostatistics or similar fields?
Do you see a future where Python (or even Julia) might replace R in academia? Should I be concerned about R becoming less relevant over time?

Personally, I’d prefer to stick with R since it fits all my needs so far. But I’m open to learning new tools if it’s crucial for staying competitive in the field. I’m also curious to hear your perspectives on whether being fluent in more than one statistical programming language (like Python or Julia) provides a significant advantage.

R_cubed · October 13, 2024, 8:34pm

This video by a quantitative finance professional discusses some of the problems with Python. Keep in mind that R is the language developed by statisticians for statisticians. If you know how to do things correctly, you can port that knowledge to Python, but don’t depend on packages to do the thinking for you.

f2harrell · October 13, 2024, 9:48pm

Oversimplifying a bit, here are two important points:

Python is way behind R in statistical methods (try to do a partial proportional odds ordinal logistic model with random effects in Python for example)
If you are analyzing high-dimensional data matrices as occur in imaging data, genomics, metabolomics, RNA-seq, single cell data, Python is going to have more capabilities and handle larger datasets

EpiMD5 · October 13, 2024, 10:28pm

I wonder if someone might comment on R versus Python for Natural Language Processing.

Not me.

f2harrell · October 14, 2024, 12:12pm

A good area to consider. For one of the R options see this especially the example for the corpus package.

robinblythe · October 15, 2024, 2:16am

I had this exact discussion with a senior quantitative medicine professor at my institution. He is fairly certain that many of the capabilities of Python with regard to n-dimensional matrices will be available and optimised for R within the next few years.

I suppose a more pertinent question might be, how often have you run into problems you couldn’t solve with R?

jarbet · October 15, 2024, 3:50am

I haven’t run into any problems I haven’t been able to solve in R yet

Perhaps this is a reason I should learn some basic python, since I work in cancer genomics with the data you mentioned. Although I’ve been fine with R thus far, I have noticed that a lot of the bioinformatics journals I follow have been having more python packages published with their papers lately.

Longterm, I suppose learning some python will increase opportunities for me to collaborate with others in the field, although thankfully the lab I work at mostly uses R so I haven’t had much of an issue so far.

Regarding the issue of working with big data that you and @robinblythe brought up, R has some infrastructure for working with datasets outside of RAM: e.g. the bigmemory suite of packages for working with large datasets on your harddrive, or various options for working with databases like sparklyr or duckplyr), but my understanding is that these options support a pretty limited set of statistical analysis models/tools and that you will often run into a scenario where the model you need to run requires holding the data in RAM.

f2harrell · October 15, 2024, 12:03pm

And by the way be sure to study Bioconductor before doing any genomics analysis outside of R.

js592 · October 15, 2024, 3:38pm

Andrew Gelman’s blog has had some interesting discussion on the future of cutting edge Bayesian computation recently:
https://statmodeling.stat.columbia.edu/2024/10/08/defining-statistical-models-in-jax/
https://statmodeling.stat.columbia.edu/2024/09/25/stan-faster-than-jax-on-cpu/

I think if you are working with models and/or data that are pushing the limits of what R can handle the natural next place to look would be Python. There are plenty of modern ML methods that are really only implementable outside of R (though the applicability of those to the datasets and topics discussed here is left as an exercise to the reader ).

Anecdotally, I have had people tell me that many industry positions (at least more on the data science rather than biostats side) are using python. But, since I’m not actively interviewing I couldn’t tell you if my personal experience backs that up.

Nikolai · October 22, 2024, 1:39pm

Thanks so much for this topic to discuss!
An issue in my medical university: teaching exclusively Python to students.
And it’s really difficult for me to explain the contrary to many my colleagues. For epidemiology, for statistical analysis R can solve any problem .
By contrast, for technical and engineering high school, ok, perhaps, use Python…