Concrete examples of the pre-planned analysis going wrong

It’s a shibboleth of classical study design that all analyses should be pre-specified and strictly adhered to. However that can lead to less than ideal decision-making, like the case of carvedilol’s use for heart failure, due to the FDA’s two-positive trial requirement. The pre-specified endpoint was an exercise outcome that did not obtain a statistically significant difference between treatment arms. However, most of the other outcomes, including mortatlity, did show a beneficial treatment effect of carvedilol that was statistically significant. The first time the FDA advisory committee evaluated the trial findings, they did not approve it because there weren’t two trials with statistically signficant findings on the pre-specified primary outcome. On subsequent review, a different advisory committee took into account the beneficial treatment effect on the secondary outcome of mortality, as well as the prior information that drugs in the same class as carvedilol had been previously found to be safe and efficacious for heart failure, and approved carvedilol.

Setting aside the issues of NHST in the FDA approval process, what are your favorite case studies of pre-specified analyses going wrong? Either a pre-specified analysis that led to poor decisions or an exploratory analysis that found something that would have been overlooked if only the pre-specified analysis had been conducted.


I’m looking forward to some case studies related to your direct question. I can’t think of any example right now, but the following links discuss the FDA 2 trial rule, and its rather ambiguous status as a valid decision procedure.

There are some recent papers by Leonhard Held and colleagues on p-values for replication which discuss the 2 trial rule, which I’ll post later.

1 Like

Thank you for sharing that post! I agree that the FDA 2 trial rule is difficult to reconcile with as a valid decision procedure. Approaches like this paper by Isakov, Lo, and Montazerhodjat that conducts a full decision analysis to reckon with the burden of disease and the costs and benefits for patients seem more justified than simply requiring a type I error rate of 0.05 for all.

1 Like

The number of examples in which pre-planned analyses went wrong is astronomical because so many plans are poorly thought out. The most frequent examples are so-called “negative” studies in which evidence for treatment effectiveness may have been found had covariates been adjusted for or a more information-rich outcome variable used. We never know about these because very few such studies get re-analyzed with appropriate method.


the more i think about it the less i like analysis plans, but they’re so entrenched in people’s thinking it’s hard work pushing back. I recently raised the issue, ie time invested versus added value, and people had never heard the arguments before. The statistician is the one with incentive to Q this habit, because it is their time
I knew a pharma company who restricted the analysis plan to one page in the protocol, but the idea was too commonsensical to survive. Are there any papers out there making the case against excessive prespecification? I appreciate fully why a cro wants to protect themselves against a capricious client, but outside this scenario a sap feels excessive to me. In academia the sap can be revised to match analyses rather than vice versa, it’s a total ruse
i can see the value of a sap as a commuication tool (eg: consider validation, or a more junior statistician following a sap). But im not persuaded so much by the claim re prespecification, nor the claim re documentation (surely good code, heavily commented, provides better documentation of what was actually done, also minutes from meetings and emails)
to answer the Q re pre-planned analyses going wrong, I have seen regulatory authorities give conflicting advice and the sponsor tries to design a study satisfying both, and it’s a mess

1 Like

I’m in favor of more pre-specification, not less. But with a caveat: statisticians must have adequate time funded to do a first-rate job in developing the SAP in conjunction with clinical experts. This assumes that we increase the quality level; there are too many SAPs using suboptimal statistical methods such as forgetting to adjust for covariates or using models that are not likely to fit well.

To make pre-specification work, we need a revolution in how statistical models are specified. Models need to be more flexible, and statisticians need to learn that you need parameters for things you don’t know (interactions, non-normality, unequal variances, etc.).


@f2harrell I could have sworn you published on one of your websites a guideline on how to develop a statistical analysis plan. I thought I had downloaded it separately from the BBR or RMS notes, but cannot find it in my collection. Did you create such a document, or am I imagining things? One of the things I recall – an extensive collection of covariates to collect for adjustment.

i think this is partly because saps and prespecification promote copy-pasting and reliance on conventional, stepwise type thinking. The methods section of a paper is brief, the sap could be likewise, and the time saved spent ensuring data quality and quality code. Ive heard it said that specifying analysis makes us fixate on data collection and study design, i can see the value in prespecification here (limited to protocol, discusisos with clinical), but people want saps drafted for retrospective analyses, i dont see value in it, especially when people are inclined to spend 3-4 weeks drafting a sap

i think discarding saps would be the quickest way to achieve this. Im convinced that the sap is mostly a ruse, for appearance sake, adherence to checklists etc. The primary analysis will be non controversial and thus hardly needs prespecification

I’m not sure I understand. Isn’t a pre-specified statistical analysis plan absolutely essential to prevent gaming at the analytic phase of any study (RCT or observational) in an effort to obtain p<0.05? The fact that many statistical analysis plans are of poor quality doesn’t seem like a reason to discard them, but rather to improve them (i.e., with more statistical input).

I don’t know the whole history of the carvedilol approval (see original post), but it sounds like the inferential problems arose because of poor planning of the primary endpoint (exercise tolerance). When the primary endpoint failed but there was a signal of mortality benefit (one of the secondary endpoints), nobody knew how to interpret that (?)


from Mayo’s SIST: “…predesignating hypotheses by the way doesnt preclude bias, that view is a holdover from a crude empiricism that assumes data are unproblematically “given” rather than selected and interpreted”
but i would say 1) sap doesnt prevent “gaming” anyway, if it did there would be zero “gaming” by industry, 2) prespecifying encourages tried and trusted methods because you must specify before seeing the data, more sophisticated methods might require an unreasonable amount of prespecification and guessing (altho in practice maybe you see some blinded data or have past experience), 3) there are bigger problems which get less attention, eg i saw an academic trial published in nejm where the proportion of pts with missing data was something like 60-70%, but no problem: theyd prespecified multiple imputation 4) during peer review analyses that are outside the sap are often requested, 5) “The fact that many statistical analysis plans are of poor quality doesn’t seem like a reason to discard them” - id say poor quality is an inherent feature of the sap, it’s too formulaic, most research groups have a template for it … etc

1 Like

That part of an SAP is in the RMS course notes and book.

Because of the awful track record of such research, prospective SAPs should be mandatory to prevent what Gelman calls the “garden of forking paths”. It is important though that the SAP contain reasons for not sticking to the SAP. If the analysis plan is changed for a reason not given in the original SAP, much suspicion should be raised. (Keep in mind that the biggest reason for changing plans right now is P > 0.05).

Of course. But the point is that pre-specification will significantly reduce the amount of gaming.

If a baseline variable is missing 0.7 of the time and multiple imputation is used, that can be perfectly OK. That’s better than ignoring the variable entirely.

ive read gelman’s paper re forking paths, he reiterated the point in reference to that playful cornell food researcher. Re clinical Qs: I dont even know if the model will run, if we’re talking about large registry data, you might confront “insufficient memory” problems, in this case i dont even know what software im going to use, many unanticipated data issues too, then youre speculating in the sap about things that may or may not become relevant. Ive seen analyses incessantly revised, with the sap trying to keep pace, much to my irritation, but never to yield a lower p-value, the database is large, youve got the narrow CIs, and the primary Q doesnt change

i wish that were true. Drug companies pre specify two dozen secondary endpoints, who are we fooling?

it raises very serious Qs about trial monitoring, data quality, competence etc. If they want to pre specify anything they can pre specify quality tolerance limits, this is now in ich e6 i believe, in that particular case i’m very much in favour of prespecification, ie planning and monitoring

I agree with some of your points. I think this is mainly about the level of detail in the SAP. I want to see model structure but not very fine details.


I agree, in the context of the current FDA regulations, having something pre-specified so that there’s no fishing expedition makes sense. It’s an example of Goodhart’s law, a NHST with p-value < 0.05 has become the target the FDA is looking for, so it’s no longer a good measure. A p-value is also not a good measure of evidence for reasons that have been explained by folks like Paul Meehl, Jacob Cohen, Sander Greenland, Frank Harrell, and countless others for going on 70 years now. I think the FDA approval process would perform better with an explicit cost-benefit analysis rather than the two trial rule.

Perhaps the question is too broadly stated. I’m looking for case studies where a slavish adherence to the pre-specified analysis would have missed something important. This doesn’t have to be in the context of an FDA registration trial. These could be exploratory analyses. That is, examples of where the analysis didn’t stick to some pre-specified plan, not because the authors were chasing p < 0.05, but something in the data hit them between the eyes and they tried to understand it.

When you do a formal decision analysis you see that \alpha=0.05 is quite silly:

1 Like

i guess the classic example is patients who are randomised despite not meeting eligibility criteria, ie whether to include them in the analysis, stephen senn wrote about that a couple decades ago, he said they should go in the analysis i believe ie deviate from pre specification


yes, i’d be happy for smaller saps. The places i most want pre specification is where it’s most difficult eg safety

1 Like