The Clinical and Statistical Saga of Thrombolysis in Acute Ischemic Stroke

There is a longstanding dispute between stroke neurologists and some emergency physicians over over the risk/benefit calculus for intravenous thrombolytics in treating AIS. The controversy dates back to the time of the early thrombolysis trials in the mid 1990s and early 2000s (links to two of the trials are shown below):

Each side has different opinions about the drivers for the controversy (disclaimer- I’m neither a stroke neurologist nor an ER physician, but rather a family physician/concerned bystander):

Some emergency physicians:

  • thrombolytic trials in AIS did not convincingly show efficacy;
  • the distribution of baseline features of enrolled subjects suggests that the randomization process might have been suboptimal/defective;
  • results of the trials were “fragile” and therefore unreliable;
  • we should question widespread adoption of a practice with dubious efficacy and which poses serious potential risks [i.e., intracranial hemorrhage (ICH)];
  • unless the AIS diagnosis is secure, the risk/benefit calculus for thrombolysis will be unfavourable- the patient will incur a small but real risk of ICH, without a reasonable expectation of benefit;

Stroke neurologists:

  • the evidence base supporting the urgency of AIS diagnosis and treatment within the 0-3 and/or 0-4.5 hour window is robust; statistical re-analyses of the original trials corroborate this benefit;
  • knowledge of how brain tissue damage progresses over time underpins the time-sensitive nature of stroke treatment. The ideal time window for thrombolysis was not initially known; this explains why not all trials were positive. To this end, trials studying the efficacy of late thrombolysis shouldn’t be lumped with trials studying the efficacy of early thrombolysis;
  • the skepticism of (some) ER physicians, while ostensibly hingeing on statistical arguments, might instead stem primarily from suboptimal confidence in diagnosing AIS rapidly and accurately (plausibly a bigger problem for ER MDs working in areas where neurology and imaging support could be suboptimal):
  • health systems should focus on addressing gaps in infrastructure/stroke team support for ER physicians;
  • the standard of care is to offer thrombolysis to eligible patients presenting with AIS. Additional trials examining the efficacy of early thrombolysis will not be forthcoming.

The statistical methods used in these trials seem very complex (likely beyond deep understanding by non-statisticians). Some of the key concepts include:

  • how best to analyze ordinal outcomes;

  • how best to show treatment effects graphically;

  • the importance of identifying, prespecifying, and adjusting for important prognostic factors;

  • how to interpret baseline between-arm covariate prognostic factor imbalances. What can we infer from them (if anything) about the integrity of the randomization apparatus? What are the implications (if any) for the validity of treatment effect estimates?;

  • the pitfalls of:

    • change from baseline analyses;
    • dichotomizing ordinal outcomes;
    • “fragility” index:
    • NNT;
    • subgroup analysis/“responder” analysis;
    • evidence synthesis involving lumping of dissimilar trials.

Some criticism of AIS trials seems like it might be rooted in statistical misunderstanding. But it’s also important to ask whether the trial results might have been less contentious if slightly different statistical approaches had been used:

As-yet-unconvinced ER physicians are unlikely to change their minds about the efficacy of AIS thrombolysis. Many might question this stance. Sometimes, going against the herd turns out to be the correct approach. But, surely, it’s implausible that stroke neurologists, many ER physicians, and statisticians across multiple continents would have reached consensus on the importance of timely thrombolysis for eligible patients without good reason (?) We can always speculate about potential biases among experts who have built their careers around certain treatment approaches. But arguments rooted in accusations of bias are unverifiable and therefore unscientific. Ultimately, unwarranted skepticism of efficacious treatments can be as harmful to patients as unwarranted enthusiasm for non-efficacious treatments.


For context, here are two examples of the Emergency Medicine point of view I think you’re alluding to (but not linking?):

Caveat: not in EM or neurology.


Thanks. I have seen these reviews. Many groups have turned these trials inside out over the past few decades, including the original researchers, regulatory agencies, and independent panels with experts in neurology, emergency medicine, diagnostic imaging, statistics, and epidemiology.

FDA’s original clinical review:

An independent reanalysis of the NINDS data:

Another reanalysis done in the UK:

The main cause for concern for some clinicians seems to be the fact that a higher proportion of patients allocated to the tPA arm than the placebo arm in the NINDS trials had presenting strokes that were classified as “mild” in severity. Because of this imbalance, clinicians wondered whether a higher proportion of the patients randomized to the tPA arm were “destined” to get better even without tPA. In other words, they asked whether it was the tPA itself OR a more favourable group-level prognosis at baseline in the tPA arm that explained the apparent “win” for tPA in the trial.

The links above show that MUCH ink was spilled addressing the concerns of some clinicians about the implications of the between-arm baseline imbalance in stroke severity. Unfortunately, I’d bet that very few physicians (unless they also had a PhD in statistics) would be able to make sense of the details in these statistical reviews. For most, our eyes would first glaze over and then skip to the “bottom line.” And the bottom line is that several groups of highly experienced professional statisticians and epidemiologists are confident that these trials, warts and all, support the efficacy of tPA for treatment of AIS.

I suspect that it’s the complexity of these statistical reanalyses that has perpetuated suspicion among some members of the clinical community. We don’t like, and are less likely to trust, things we don’t understand. When we see something in data that we feel, intuitively, MUST affect the results of a trial, and yet a group of experts tries to reassure us, using a language we don’t really understand, that what we see is not actually as important as it seems, some will rebel.

I trust the opinions of highly experienced professional statisticians. But I do, nonetheless, have a question stuck in my craw about the design of these trials. On page 44 of the FDA review linked above, the reviewer states, under the heading “Analysis of Efficacy Endpoints with Covariates”:

"No covariates for the analysis of the efficacy endpoints were specified prospectively. Exploratory analyses were conducted by both Genentech and CBER to examine the effects that (sic) any imbalances in the randomized assignments on the dichotomized efficacy outcome results.”

If this means that important prognostic factors were not prespecified (in the design phase) for adjustment in the analysis phase of the trials, then why was this not done? If it was not done, might this step have prevented subsequent concerns about the impact of baseline stroke severity “imbalance”?

These trials have been dissected in minute detail. To an untrained bystander, the overarching lessons from this saga seem to be:

  • when designing clinical trials to test the efficacy of drug treatment protocols that will require co-operation among multiple disciplines (e.g., ER, radiology, neurology), deliberations over trial design and analysis should be transparent at every stage. Achieving widespread a priori consensus about potentially important prognostic factors that will require adjustment in the analysis phase, seems key;

  • trial design should include methods that lose the least information possible, maximizing the ability to detect a treatment effect, if it exists;

  • communication with non-statistician stakeholders in a language they can understand is vital if widespread consensus on interpretation of trial results is the end goal.


In my opinion, the review by Justin Morgenstern (first10em) is careful and, importantly, based on clinical experience of being the treating physician for a patient presenting with symptoms which might (and might not) turn out to be acute ischemic stroke. I don’t think his use of fragility index is the strong part of the argument for being careful with an agent with probably small clinical effect and a considerable (and much more reproducible than the wanted effect) risk of bleeding/death. I also don’t think that he gets lost in the details of the reanalysis of the NINDS data (where, as you say, a LOT of ink has been spilled/spent), but rather takes a step back and looks at the totality of the many RCTs of thrombolytics in stroke - from the perspective of an emergency physician.

I like your concluding suggestions for future trialists!


Thanks again. I’m not keen to delve into the details of the blog posts you’ve cited. Suffice it to say that many/?most of these arguments have been addressed by others who have come to different conclusions. I agree that the author sounds like a very thoughtful physician. I expect that he, like physicians on both sides of this debate, cares deeply about his patients. I also expect that he capably manages many neurologic emergencies in his job as an ER physician. However, he is neither a statistician nor a stroke neurologist. This fact, on its own, does not invalidate his opinions. But given the complexity of the statistics underpinning these trials (especially the detailed statistical rebuttals to criticism coming from the ER physician community), I don’t believe that it’s possible for those without extensive occupational experience in applied statistics to fully understand the arguments defending these trials.

You wouldn’t want me managing your trauma resuscitation- you would want an expert ER physician, guided by evidence that has been scrutinized closely by bodies of ER physicians working with expert statisticians. Similarly, I would want a stroke neurologist to help diagnose and manage my stroke, making treatment decisions guided by an evidence base that has received close scrutiny by expert statisticians.

The fact that many emergency medicine physicians support the use of tPA for AIS signals that opinions are not clearly delineated by clinical specialty. Rather, there seems to be a vocal subset of ER physicians who have done their own appraisals of the evidence and come to different conclusions than multiple international expert panels. If there is a large group of professional statisticians who has rebutted the opinions of the professional statisticians and epidemiologists on these expert panels, I’d be interested to hear what they have to say.


I agree with most of this, but I am not convinced that the statistical methods are necessarily too complex, or at least that this is the major issue with the critical reappraisal. There seems to be considerable variability in methods used in the reanalyses, and there are some practices that should be discussed such as change from baseline and dichotomizing outcomes.

The sgem discussion of the paper with author (and stroke neurologist) Dr. Saver was illuminating.


I agree with most of this, but I am not convinced that the statistical methods are necessarily too complex, or at least that this is the major issue with the critical reappraisal.

I think this sentence is at the absolute core of this whole issue. Clinicians, whether they be stroke neurologists or emergency physicians, are not, in spite of their many years of training, professional statisticians. We would not see internecine wars fought between applied statisticians with decades of experience with the “fragility index” at the core of their arguments. I don’t believe I read the term “fragility index” anywhere in the reviews of this evidence by international panels that included professional statisticians and epidemiologists.

I imagine that it takes decades for applied statisticians to hone their craft enough to be able to provide input on the design and analysis of clinical trials of this importance and complexity. Failure to recognize that our training as physicians is simply insufficient for this task has, IMO, been one of the main reasons why this topic is still being debated 27 years after results of the NINDS trials were published. Suggestion- read through the statistical sections of the panel reports linked above and let me know how much of them you can honestly say you understood thoroughly enough to be able to rebut the arguments. This was a very eye-opening exercise for me…


Just as a side note: I am not arguing that medical stats/biostatistics as a field isn’t complex and subtle, and beyond the grasp of most clinicians - of course it is. I do, however, think that the reason for there being a multiplicity of RCTs on the subject is not just due to the complexity of the analysis.

edit: and to be clear, “illuminating” above was not a judgement of the merits of the respective arguments, but illuminating what those arguments actually are.


The fact that analysis of covariance is not the primary analysis in virtually all randomized parallel-group trials is no longer defensible. Statisticians are complicit in this as many of us remain silent when protocols are being written.


Reading, for example, the abstract of the ECASS trial, it is hard to say that skepticism is unwarranted:

“More patients had a favorable outcome with alteplase than with placebo (52.4% vs. 45.2%; odds ratio, 1.34; 95% confidence interval [CI], 1.02 to 1.76; P=0.04). In the global analysis, the outcome was also improved with alteplase as compared with placebo (odds ratio, 1.28; 95% CI, 1.00 to 1.65; P<0.05). The incidence of intracranial hemorrhage was higher with alteplase than with placebo (for any intracranial hemorrhage, 27.0% vs. 17.6%; P=0.001; for symptomatic intracranial hemorrhage, 2.4% vs. 0.2%; P=0.008).”

The data from this trial concerning efficacy are consistent with OR that range from 1.02 to 1.76.
So the results are consistent with OR of 1.02, as well as 1.76. And considering the possibility of a very small benefit, skepticism may be warranted when there is a considerable risk for symptomatic intracranial hemorrhage.


The question of how to make sense of “portfolios” of trials is a recurring one in many fields. A small number of “positive” trials (defined as p<0.05) floating in a sea of trials that didn’t meet this threshold can be viewed in one of two ways:

1. Skeptical view: If we conduct enough trials, some will eventually “happen” to be positive by chance. We should look at the ratio of “positive” trials to “non-positive” trials when making inferences about treatment efficacy. By this standard, critics of thrombolysis in AIS might argue that this portfolio seems unimpressive; OR

2. Nuanced view: It is overly simplistic to tally the number of “positive” versus “non-positive” trials when making inferences about treatment efficacy. There is more than one reason why a trial can fail to reach a p<0.05 threshold. One possible reason is that the treatment lacks meaningful intrinsic efficacy. The other possibility is that the design of the trial was not yet optimized, resulting in the trial being incapable of discerning the signal of intrinsic efficacy from the “noise” generated by suboptimal design choices. Sometimes, a group of trials is conducted in an effort to gradually home in on the patient group most likely to respond to treatment and to optimize drug dose/timing/treatment protocol. There is nothing nefarious about this process- this is just science.

Detecting a treatment efficacy signal is MUCH harder when the condition being studied has substantial potential to improve spontaneously. I would imagine that predicting which patients’ symptoms are destined to resolve spontaneously (i.e., TIA) versus persist in the absence of treatment, possibly resulting in lifelong disability (i.e., stroke) is one of the key challenges in optimizing the risk/benefit calculus for tPA. Similar problems with low “signal to noise” ratios plague research in other fields, including depression and chronic pain, where spontaneous fluctuation in disease/symptom severity is common.

This method is known as a “vote count” method of meta-analysis. Thinking about reports in this way is in contradiction to the idea that meta-analysis can maximize power and detect an aggregate effect when individual studies do not (as noted in your second point).

It was mathematically proven to be biased against finding true effects by Hedges and Olkin in their classic Statistical Methods for Meta-Analysis

A better method is p-value combination, as I’ve mentioned in other threads. These generalize to modern “fusion learning” and “confidence distribution” methods.

See also: