What would get Stan more use in FDA IND applications?

Lately, I’ve been thinking about how to get Bayesian methods and software used more for FDA approvals. Right now, we have been working with customers on internal decision-making, but we would love to do the community a service and lower the barriers to using Stan in INDs.

There is a lot of nuance and different things that can be done. Right now, I’m considering:

  • software: is there a layer that needs to be added to make it easier to use?
  • software: would it help the industry as a whole if there was validation of Stan?
  • software: if there was a service around this that guaranteed reproducibility (both from an submitter and for the FDA), does that help? Or even standard operating procedures?
  • education: is there a lack of information on how to use Stan? would an immediate benefit be creating more content around this?
  • repository: would it help to create repositories of models that behave well?

I’ve been navigating FDA procedures and finding different docs on software validation. If anyone has dealt with this, either on the FDA side or industry, and would be open to a discussion, please reach out. (Responses here are great too… I feel like a 30 minute chat would be pretty effective at communicating quickly.)


it’s an interesting Q. you would need to infiltrate the industry groups eg phuse and psi. i don’t see validation as an issue because they would have to validate the analysis anyway (if you have a stan person the sas person would validate them, this is routine). the ‘problem’ with industry is the emphasis on efficiency and error-free results means they are somewhat averse to novel software and methods. a cro just has no time to play with, they are constantly up against it. maybe the software could be used in the early phase work where bayes is more likely to appear (eg phase I dose finding, adaptive designs) and then people become familiar with it?



There are likely two “high-level” issues to consider and I will defer to others with more “hands on” interaction with the FDA on this topic.

The first is, irrespective of software implementation, the acceptance by the FDA and other regulatory agencies, of Bayesian methods in submissions. As you are probably aware, there is 2010 guidance from the FDA that addresses this at various levels:


That being said, there is relatively new draft guidance, which includes discussion of Bayesian methods, that if you are not aware of, you may want to review:


This draft document begins to provide insights into current thinking within the FDA, that is more recent than the 2010 document. That may help you to identify challenges on the FDA side relative to general acceptance.

The second issue is software, and as you likely know, there is the 2015 FDA statistical software clarifying statement:


which makes it clear that the FDA does not require (nor endorse or validate) any specific software, despite long-standing perceptions to the contrary.

Validation is essentially all on the end-user, not on the software publisher. That is, there is no “Good Housekeeping Seal of Approval” stamp on any software, physically or virtually, that obviates the burden on the end user. So, a lot of the software side of things comes down to the end-user having internal SOPs (e.g. IQ/OQ/PQ) that are relevant in this domain. The scope of these SOPs will be largely driven by how risk averse is the environment that the end-user is operating within.

That does not mean that the software publisher, author, maintainer and indeed the community at large, cannot assist in that process by creating and making available tools and procedures that can streamline the process for the end-user. They need to be motivated, given the extra burden involved, to support the use of these tools within this specific domain, which may only represent a small proportion of their user base.

Some years ago (circa 2006-ish), a small working group, including Frank and me, drafted a guidance document on the use of R in regulatory submissions, which has been updated over time:

The intent of the document was to hopefully clarify some key issues, and to help provide a framework that would enable an increasing use of R in this domain. Looking back, I hope that it has, and there has also been greater acceptance of R at the FDA, thanks to both increasing internal support and comfort with open source software more generally.

In the intervening years, some of the commercial vendors of R have built toolsets and implemented internal validation processes, that have resulted in more restrictive “validated releases”, that have undergone these additional internal quality checks, as value added services to their clients. I can’t speak to the impact that these services have had, as I don’t use them and have not heard from folks that do.

There are community efforts, and @pmbrown referenced PSI, which I know has been involved in recent years, at least with R, in providing a forum for the discussion and implementation of tools that may be helpful. There is more information on their efforts here:


and you may wish to contact them to see if there are any relevant efforts, either within their organization, or perhaps with other parties, that would be relevant to Stan.

As @pmbrown notes, there is a good chance that larger and more established industry entities will be more resistant, and will require more documentation and processes to achieve a level of comfort. Change is not easy, and you can be hindered by normal human behavior that is resistant to change as a result. There need to be clear, value based, catalysts for the change and the nature of those will vary depending upon the environment.

We have seen that, for example, with R more generally over the years, and that has been aided by new statistical staff coming on board with experience in using R, both on the industry side and on the FDA side, and that has slowly changed the internal dynamics. On the industry side, there has also been a slow, but measurable, migration of the tools over the years from the pre-clinical side to the clinical side as comfort increases.


Stan is being used a good deal at FDA and there is nothing holding it back. The biggest obstacles in my experience are:

  • pharma statisticians not being trained in Bayes
  • FDA statisticians not being trained in Bayes
  • academic statisticians not being trained in Bayes
  • FDA reviewers are not always prepared to review Bayesian designs
  • the use of complex hybrid frequentist/Bayesian approaches that try to placate frequentists but bring the worst of both worlds
  • like the fact that SAS is still being used in pharma, pharma leaders feel safe and don’t get criticized enough for doing things the old way; they like to minimize risk even though they consistently lose $ with the old approaches

I’d like to push back a bit against the uncritical supposition that Stan \equiv Bayes. My April arXiv paper [1] is addressed in large part toward inadequacies in FDA review of INDs, and employs a JAGS model with a geometry that Stan apparently does not handle well. (The observations modeled arise—like so many clinical assessments in medicine—via interval censoring of a latent continuous variable.)

Even strictly from an implementaton perspective, the fully declarative nature of JAGS proved immensely useful in this application:

But in regulatory applications where validation of an application is especially desirable, declarative programming has special, additional value. I expect that a Stan implementation of my model in [1] would be much longer and less transparent, and perhaps not even sample as well given the geometry. (If I’m wrong about this, I would love to know!)

  1. Norris DC. Retrospective analysis of a fatal dose-finding trial. arXiv:200412755 [statME]. Published online April 27, 2020. https://arxiv.org/abs/2004.12755

Could you elaborate more on this? I’d be interested in what has been done to bridge the gap between frequentist and Bayesian reports in applied settings. I’d be particularly interested in what you see as the worst features of the attempts at hybrid approaches.

There is a long line of thought, probably going back to at least IJ Good (maybe even earlier) that there is some sort of Bayes-Frequentist compromise that can be worked out. Jim Berger and colleagues have written a number papers on that.


Briefly, the hybrid approach that involves computing type I assertion probabilities \alpha for a Bayesian procedure have these problems:

  • if external information is used it is impossible to even compute \alpha but some statisticians pretend to do this anyway to placate reviewers. Once one has been allowed to borrow information from other studies, the null in the frequentist sense has already been ruled out.
  • to even define \alpha one must codify all the investigators’ intentions to analyze the data, which are not usually known before data are collected. For example, DSMB meetings may be canceled or rescheduled.
  • the mere act of computing \alpha makes reviewers think that \alpha is relevant to Bayesian procedures when it’s not
  • quoting \alpha makes reviewers think that \alpha is relevant once data collected. Like pre-study power, it is no longer relevant.
  • “preserving” \alpha will lead to suboptimum decisions because frequentists (unlike likelihoodists) never worked out a satisfactory compromise between type I probability \alpha and type II probability \beta
  • the mixed approach changes the subject away from the quantity that should be verified for reliability: the quantification of evidence at the decision point as shown with simple simulations here; this is in stark contrast to computing pre-study tendencies like \alpha and \beta that are not functions of your data.

:new: The “worst of both worlds” is completed by observing that study designers who aim at controlling \alpha have to play with either the prior or the interpretation of the posterior probability, making them lose their meaning.


Thanks, all, for the thoughtful comments. This is clearly a complicated, multi-faceted topic; I would have expected no less.

@pmbrown: thanks. I’ll look into the industry group angle.

@MSchwartz: thank you very much for the response. All of that was very insightful.

I completely agree and was thinking about that as I posted the question. It’s not just a matter of having software that is trustworthy. It’s also about having collective knowledge that can apply it.

@f2harrell: thanks for both responses.

Thanks for emphasizing training. Where’s the place to start? I’ll be in the pharma industry for the long haul, so happy to start with a small step in the right direction.

And thanks for the response on \alpha… I’ve just started to run into this issue in practice.

@davidcnorrismd: thanks for your response and linking your arXiv paper. Let me get back to you on that. (Just FYI, I don’t think “Stan = Bayes.” As for one of the better tools for Bayesian modeling and inference, I’d say that Stan is at the top or close [at this moment in time].)

1 Like

Some resources are at hbiostat.org/bayes but join http://www.bayesianscientific.org/ and its educational efforts.


psi journal club tomorrow will discuss two papers illustrating use of bayes in industry trials and one of the papers is authored by an fda person: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ALLSTAT;1e24edd1.2011 (not sure if the time difference works for you ie 4pm gmt)


Thanks! I should make it there tomorrow.

Interesting paper. Been trying to figure out how to handle this situation in pediatric oncology. If you’re open to a chat sometime let me know.


1 Like

Thank you, Mark! I’m always open to discussions of this stuff. Everyone should feel free to reach out to me at the email from the paper. Best regards, David

I have some experience on both sides (industry and regulatory) with preclinical IND studies, especially toxicology and safety studies. My opinion only from this point on. Trying to get preclinical CRO’s to use any sort of Bayesian approach in these studies is a difficult task - there is simply too much historical inertia. Studies are designed with a control and 3 levels of test article, and Dunnett’s or Dunn’s test is used to compare results to controls. This has been a standard for over 50 years. Study designs, sample sizes, etc. are etched in stone someplace. Analysis systems have been standardized and validated so that there is little, if any, input from a statistician unless something very unusual is being done. It simply is not cost effective to change, when the return on investment AT THE PRECLINICAL LEVEL isn’t apparent. So, how and where to change things? Again, only my opinion, but it is at the training level for toxicologists and pharmacologists. Until end users see the advantage of a Bayesian approach, there is little impetus to change. Contrast this with the clinical side, where the clear advantage of Bayesian analyses is changing the design and analysis of RCTs. But those are big expensive studies, and preclinicals are not - they are actually just canary-in-a-coalmine efforts such that first in human and Phase I trials don’t use obviously toxic doses.


Well put.

The opportunities for adaptive designs to revolutionize that process is enormous, and Bayesian methods would be required to do a proper inference in such an adaptive setup.