I am a graduate of a Master of Science in Clinical Investigation program (Vanderbilt, around 2008), in which traditional frequentist statistics were explained by talented teachers in my opinion. Yet, my current understanding is that frequentist statistics is considered inferior or perhaps brutish when compared to other approaches, and this view is held even or especially in the biostats department in which I trained. Truth be told, it pains me to read comments denigrating the education I received relatively recently. And it worries me that this pattern of teaching one approach & denigrating it as inferior to a different method will continue without resolution.
I find myself wondering: from whence resistance to teaching enlightened statistics? If it’s being taught incorrectly, what’s the barrier to teaching correctly?
I’ve been guilty of making comments that sound like denigration of statistical education that clinicians have received. I need to clarify that we statisticians can been primary root causes for this, with secondary blame on clinicians for believing that their literature sets the trend for which statistical methods should be used in medical research for all time. Journal editors and reviewers share a lot of the blame. But it’s best for all of us not to think of methods used in the past as being disasters but rather concentrate on continual quality improvement.
We won’t make enough progress until teachers teach new approaches on at least an equal footing with traditional approaches. A huge issue is that I keep hearing that we need to teach so that students will understand what they see in the literature. That will never get the quality improvement we need.
What I do in co-teaching the Biostatistics II course in the MSCI program from which you graduated is at every opportunity reflect on what differences there would be with a Bayesian approach. But we don’t have time to teach the Bayes how-to. I hope we can find a way to do that. Software development is certainly helping.
Another hindrance is the way we teach probabilistic diagnosis. We teach it from the p-value/type I/type II error approach - there is an exact analogy with specificity and sensitivity, which IMHO just divert the student away from understanding the real conditional (patient-specific) probabilities that are relevant to the clinical decision. We are still teaching that post-test probabilities need to use sensitivity and specificity when the data come from a prospective cohort!
I would love to get a variety of responses to your excellent and all-so-timely question.
I would be cautious with describing statistical methods as “inferior” or “superior”. I am a huge fan of the Bayesian approach and firmly believe that most human-oriented research would benefit from Bayes, but there are IMHO valid uses for null hypothesis testing (NHST). E.g. physics has a lot of situations where a theory predicts an exact value and it makes sense to use a NHST to subject the theory to a severe test (as Popper would have it). In this spirit, I think the framework advocated by Deborah Mayo and others that recasts many ideas of NHST as putting theories to test is useful, interesting and a good way to interpret published results with NHST. (I liked this introduction but she has a new book out which is likely better, but I have not read it).
In my reading, Mayo’s approach is surprisingly compatible with what Andrew Gelman (from the Bayesian camp) advocates for: test and criticize your statistical models, be aware that your model answers a specific question, care about effect sizes, stay firmly rooted in philosophy of science. They differ on the mathematical tools, but those are IMHO important ideas in general.
However, neither Mayo is taught in schools. I think part of the problem (from my experience in Czech Republic) is that a lot of people teaching statistics like scolding non-mathematicians and they are expected to cover a lot of ground in a short time - so the statistics class becomes just some sort of rite of passage you have to endure and there is no incentive to make people actually understand anything. This suits frequentist statistics well as you can teach them as procedures without understanding and they give you a lot of opportunity to “gotcha” anyone who is not an expert.
Thanks, Martin, for raising these interesting points. Re: inferior/superior, I am trying to describe a certain zeitgeist, but I don’t mean to say one method is better than another (surely, that’s application-specific).
Oh, Dr. Byrd, my friend, I am not lost for words on this particular subject…
It pains me, too, given that 1) I am even more “statistically” trained (and lack the ability to fall back on being a gifted clinician, as I’m sure you are); 2) my education was exclusively using frequentist statistics; and 3) every paper I’ve published has used frequentist statistics.
And yet, I have been participating in such discussions as well, because I have been frustrated by the degree to which frequentist statistical interpretations are misused or tortured (most recently, watching people describe the CABANA trial primary results with phrases like “ablation was no better than…” or the even more tortured “ablation was not superior to…” when the data were not especially consistent with those statements), and see the potential for Bayesian statistics to offer what I believe to be meaningful improvement in how we design, conduct, and report trials.
Since Frank already covered the former phrase (statisticians as root cause), I’ll hit the latter (bolded) phrase. I have been directly engaged in projects where I was unceremoniously ordered (in some cases quite nastily) by clinicians to carry out suboptimal statistical approaches, even after attempting to explain why Approach A was flawed and Approach B would be better, because Approach A is “what people are used to” or “what clinicians expect” or whatever doc-splanation they had for me that day. It was quite clear that they believed exactly how Frank so eloquently put it just now - that their literature dictates what statistical methods should be used, rather than careful consideration of their question and data to formulate a proper analytic approach.
That’s admittedly some of my personal frustration boiling over, but if you wonder why some statisticians take a rather acerbic tone towards clinicians/medical literature, you can probably start with anecdotes like that sowing frustration in the statistical community. It gets tough for statisticians to play nice after the first 99 incidents of a physician telling them (not asking - telling) that you just need to create a multivariable model with all of the p<0.05 variables in the model because that’s just how statistics are done and that’s what the journals expect.
But I suppose where they go low, I should go high, or something, so let’s just move on.
There is hope, based on some recent discoveries yesterday of very good statistical guidelines for the Annals of the American Thoracic Society as well as the Annals of Emergency Medicine, that some journals might be starting to come around. It’s going to take time, of course, but I was encouraged by that, and may advocate to write something similar for one or more journals that I am involved with.
Coming back to the original question, this is a huge part of the problem - medical trainees are being taught statistics informed by the philosophy above - what current program directors think they need to understand what they will see in the literature, based on the historical norms (@martinmodrak also has a great point that I’ll hit later). I was asked to give a statistics lecture to help cardiology fellows prepare for the boards. I asked what statistical content was on the boards and was told that it was basically just sensitivity, specificity, PPV, NPV, and some basic stuff about significance testing and regression models…so that’s what I spoke about. Teaching to the test. Sigh.
Other than that, whenever I have had the opportunity to speak to trainees through journal clubs or seminars, I have always intentionally chosen something outside the standard frequentist toolbox (EOLIA trial, DAWN trial) in an effort to expand their knowledge base a little bit. Next year, I will be teaching in Pitt’s ICRE, and I will be interested to see how my opinion on this changes over the next several years.
That’s the key, with an addendum. We’re caught in this weird self-perpetuating cycle: the majority of clinician-researchers have been trained using frequentist statistics (and many cling to a handful of guiding principles VERY rigidly); they also tend to be the people reviewing papers and mentoring junior clinician-researchers; so when I, junior faculty statistician, work with a cardiology fellow, I have to prevail on the mentor why Approach B is better than Approach A, and heaven forbid our paper get returned with reviewer comments asking why we didn’t just use Approach A since that’s what everyone else does…and that trainee witnesses this process and basically remembers that Approach A is what they should do next time to minimize the fuss.
That cycle needs to be broken somehow, but we run into that problem @martinmodrak brought up - limited time to teach, most people are only going to take one statistics class, you have to understand several key frequentist stats principles before you even get to Bayesian approaches (at least, I think so - would be curious if @f2harrell agrees)…so it’s really hard to get away from that standard approach of teaching trainees the regular combo platter of mean, SD, t-test, ANOVA, chi-squared test, p-values, linear regression, logistic regression, maybe a Kaplan-Meier curve, and we’re out of time, everyone have a good summer!
This is a great discussion topic, I will be curious to see other opinions.
I also think there are some simple reasons for not adopting modern stuff:
No clear alternative
While I guess most people within statistics departments would agree that the current state is not good, there is much less consensus what should replace it - the discussions are still raging (and that’s why I find it unhelpful to try to find a “best” method as it fuels the flame wars). There are many flavors of Bayes, some attempts to make frequentist better, the likelihood school… Universities don’t want to fall for a short-lived fad so it makes sense for them to be conservative and until the dust settles keep teaching what has always been taught. Not speaking about the fact that many applied researchers don’t see any problem with the current state of affairs.
While the ideas behind alternatives are old, practical implementations are a (relatively) new thing. Taking Bayes, BUGS started in 1987 but it has a lot of limitations and was (AFAIK) not really approachable to non-techies. And modern Hamiltonian Monte Carlo (especially Stan) is basically cutting-edge. The well thought-out methodology papers from Stan community and approachable R wrappers for Stan are the thing of past few years. It would actually be surprising if they were already widely adopted in basic stats courses.
In my view we need to separate existence hypotheses from the much more common setting where the estimate of a quantity of interest (e.g., drug efficacy) is the goal. The physics example is the former, and that style of research is not very present in biomedicine.
Alternatives are getting clearer by the day. Some of that is due to rapid advances in Bayesian software systems (e.g., Stan, brms, rstanarm). Some is due to more and better Bayesian teaching, plus advances in the likelihoodist school. We’ll make more progress when we really emphasize optimum decision making, which uses Bayesian posterior probabilities as direct inputs. The main reason our progress is not faster is that statisticians and non-statisticians who were first taught frequentist methods are reluctant to admit that education does not end with graduate school.
There are many Bayesian resources, only a few of which I mentioned here. I’m working on a longer list. So far the most accessible software model seems to be the R brms package.
More importantly, the time needed to explain type I and type II errors, null hypothesis testing, effect sizes used in power calculations, and what sampling means and how the study design needs to factor into the computation of p-lvaues is greater than the time needed to explain the entire Bayesian machine. Bayes is much more straightforward (but not the computations for the “final answer”). This can be addressed by creating a large number of template analyses in dynamic notebooks. We’re working on that for the two-sample t-test. Teaser: the Bayesian solution never needs to assume equal variances.
So it really comes down to your priorities in designing a course and whether you want to train students for the present or for the future.
Just to add a bit of moral support - as I alluded above, I am a card-carrying statistician by trade and I am also trying to learn this on the fly. Frank has provided some valuable resources already. This forum has proven crazy good in its infancy, too - I was able to grab some R code for something I had never done before (the example from the PARAMEDIC2 trial), run it, scrutinize it, and figure out what it did and why. Without the starting point here it would have been much harder to pick up. I know it may sound very haphazard, but sometimes the best way to learn is just finding some worked out examples from a resource such as this one, dumping the code straight into R, then looking up the individual commands one by one to teach yourself.
Blockquote Sometimes the best way to learn is just finding some worked out examples from a resource such as this one, dumping the code straight into R, then looking up the individual commands one by one to teach yourself.
Have you ever read Karl Claxton’s the irrelevance of inference? Health economics in Canada has moved towards an (almost) fully Bayesian approach to health technology evaluation on the back of this paper and related work for teams at Bristol and Leicester (Nikki Welton, Alex Sutton, Tony Ades).
I’m curious what you think about the challenge of a loss function being inherently location specific (slight exaggeration) whereas p-values (and 95% Bayesian credible intervals) have somehow managed to establish themselves as a sort of universal decision-making tool?
Also curious what your thoughts are regarding Professor Claxton’s arguments that uncertainty around a decision is only relevant for decision making if you’re planning to gather new evidence to reduce it, otherwise you should make your decision today, based on the point estimate.
Thanks for the good recommendation. Just printed it to read soon. I can identify with that approach. In fact, I think that inference is overrated even when not doing formal decision making. For example, thinking probabilistically offers many advantages, e.g. “drug B probably (0.94) lowers blood pressure better than drug A”. Such thinking doesn’t have to deal with hypothesis testing or uncertainty intervals.