Quality of remdesivir trials

The results of two large trials assessing remdesivir are now published: ACTT-1 (Beigel et al, NEJM) and WHO Solidarity (Pan et al, Preprint). They’re already generating a lot of discussion…

IDSA published a summary on the strengths and weaknesses of these trials, but I’d be really interested to hear any further opinions.

Limitations of ACTT-1 in the IDSA review:

  • The original primary endpoint was ordinal endpoint after 14 days, but this was changed to a time-to-recovery endpoint based on researchers’ realization that the clinical course of COVID-19 could be longer than initially thought.
  • The trial was not powered for mortality.
  • The clinical relevance of the ordinal scale is not clear.
  • Patients generally received remdesivir around 9 days after symptom onset; this may have been too late to see a mortality benefit.

Limitations of WHO Solidarity in the IDSA review:

  • Open-label design.
  • Time from symptom onset to randomization was not included. If there was a delay between symptom onset and presenting for care, the benefit of remdesivir (an antiviral) may have been lost.
  • Time to recovery in the remdesivir group may have been artificially extended due to a planned 10-day course.
  • Subgroup analyses were not done other than in patients who were already ventilated upon study entry.
  • The patient’s provider could choose to deviate from protocol and stop/change therapy; further details have not been shared.
  • There were variable numbers of patients in the remdesivir and interferon arms, despite the trial being randomized. This could reflect different drugs being available at different sites, but the difference is not explained.
  • An intention to treat analysis was done for the primary endpoint, despite patients not necessarily receiving the drug they were initially randomized to. An analysis was not performed based on the drugs patients were actually randomized to.

The last bullet point is more of an advantage than a limitation. Other than that, good work and I hope we get many contributions to this all-important topic.

Thanks! Yes, that’s quite strange, I thought ITT is the standard and a PP analysis would be suspicious.

i can only comment on the stats. i agree with those who regret the ad hoc ordinal outcome. i commented briefly here re earlier trials: promise and pitfalls of composite endpoints in sepsis and COVID‐19 clinical trials. while writing that opinion I was aware of @f2harrell 's post: https://hbiostat.org/proj/covid19/statdesign.html#outcomes. But on the list of “attributes of good outcome measures” is “clinically interpretable”, this was disputed for these trials. The proportional odds assumption was violated and although i saw @Stephen say on twitter recently that violation of the assumption does not matter (tweet), in jama, due to the implausibility of the assumption, they did not report an estimate of the effect. clearly this is a very big problem, especially at a time when many researchers decry the over emphasis of p-values. At least for now, i am turned off these endpoints

In rapidly evolving crisis we learn as we go along. Easier to say what I think should be done than to critique.

I think CDC has 70+ mortality at about 5 %, so I think mortality not best endpoint.

I think reducing hospitalizations and ventilations are best endpoints. My son the nurse tells me patients on vents don’t do well in general.

I think smart to enrich population to maybe 60 y/o and up.

If the mechanism of action is anything antiviral, then I think best chance for a clinical success is to treat early, so I think the trials that start rx within a very few days of diagnosis and onset of symptoms have best chance to show results.

I suppose design will be new rx on top of standard care, but treating mds will be desperately trying to save lives, so standard may rapidly change - implies enrollment must be very fast.

1 Like

the Recovery trial using mortality has been roundly praised

Recovery is in hospitalized patients. My remarks were aimed at earlier intervention with anti-virals, where I think that class of rx has the best chance of showing a strong signal. Once the target population is hospitalized patients I have no criticism of mortality as an endpoint.

1 Like

Does anyone have any thoughts on the Phase 3 SIMPLE-Severe study (e.g., subgroup analyses as per the Gilead Press Release dated July 10, 2020 (https://www.gilead.com/news-and-press/press-room/press-releases/2020/7/gilead-presents-additional-data-on-investigational-antiviral-remdesivir-for-the-treatment-of-covid-19)? See also the October 2020 letter from the CEO (https://stories.gilead.com/articles/an-open-letter-from-our-chairman-and-ceo-oct-8) and PMC7459246 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7459246/).

Experts on Twitter have discussed how the results of these two trials can be reconciled (acknowledging that results for Solidarity are, as yet, only available in preprint form). The main questions seem to be:

  • IF remdesivir helps, HOW does it help?

Some have speculated that remdesivir might only offer benefit if given early in the disease course, during a time of rapid viral replication. Many antivirals we give for other illnesses are considered only to offer benefit (i.e., shorter disease course, milder symptoms) if given in the first few days of symptoms (e.g, antivirals for HSV and HZV, oseltamivir for influenza). If this is also the mechanism of benefit for remdesivir, then ACTT-1 findings would suggest that there is still a sufficiently high level of viral replication ongoing many days after COVID symptom onset that remdesivir can offer benefit even if given relatively late in the disease course. Median time to start of treatment with remdesivir in ACTT-1 was many days after symptom onset. To this end, has anyone shown (quantitatively rather than clinically) more rapid viral clearance among patients treated with remdesivir vs comparator, even with a treatment start many days after symptom onset?

  • If the results of both trials are accepted at face value, can we reconcile the ACTT-1 finding that remdesivir improves time to recovery with Solidarity’s failure to identify a mortality benefit?

Plausibly, these findings could be reconciled if there were some type of physiologic “tripwire” for certain patients with COVID, after which they are more likely than other patients to fare worse, regardless of treatments they might receive. I’m not an expert in ID, but I think this is the idea underlying the discussions of “cytokine storm.” Since remdesivir was generally given many days after symptom onset for patients in ACTT-1 (and although we don’t know for sure, it’s reasonable to assume a similar lag for patients in Solidarity since it generally takes several days for patients to present to hospital after illness onset), might patients’ likelihood of responding to remdesivir have been “predestined” by the time they presented to hospital? If this were the case, then the next logical question to ask would be whether, if administered very early in the disease course, remdesivir might be able to prevent triggering of the tripwire and thereby demonstrate a mortality benefit.

Unfortunately, trying to corroborate this theory would prove difficult for several reasons:

  1. Remdesivir has already been approved in at least some countries, so the sponsor’s incentive to fund additional trials might be lower;
  2. Remdesivir is administered intravenously and is very expensive. We can’t afford to hospitalize huge numbers of people early in their disease course for the sole purpose of administering intravenous remdesivir. In the context of a clinical trial assessing the efficacy of early remdesivir administration, it might be feasible to arrange daily IV dosing through outpatient clinics. But positive results from such a trial could create a logistical nightmare. Daily outpatient IV administration isn’t likely scalable to huge numbers of people for many reasons. If we could reliably predict which patients were destined for a more fulminant disease course, then perhaps we could direct a limited drug supply toward this group - but this seems like a predictive modelling minefield (?). In an ideal world, cheap and effective oral versions of remdesivir would become available if early administration could be shown to reduce mortality.

So at the end of the day, unless someone invests in additional studies of remdesivir administration at the earliest stage of coronavirus infections AND unless it’s feasible to develop both cheap and effective oral versions of remdesivir, the drug seems destined to be used exclusively in patients sick enough to require hospitalization. And by the time most patients are sick enough to be hospitalized, it seems plausible that an unfortunate subset might already be on an inevitable trajectory toward a poor outcome, regardless of whether remdesivir is administered or not. For the larger number of hospitalized patients who have not passed some undefined “point of no return” (?maybe identified by the need for mechanical ventilation), ACTT-1 seems to suggest that remdesivir might help them to get better a bit faster (perhaps freeing up desperately-needed hospital beds).


The paper used an invalid test of proportional odds. This was demonstrated by my former PhD student Bercedis Peterson when she showed that the score test of PO is terribly anti-conservative. SAS implemented this test even though it was known to be inappropriate at the time. This was in 1990.

I’m glad you referred to @Stephen about PO possibly not mattering anyway. I showed here that it really doesn’t matter until one wants to estimate probabilities of individual event components.

Paul I’m having trouble getting my head around the feeling that the ordinal scale was not clinically interpretable. Can you please be specific in how it’s not interpretable?

I wonder if the point about clinical interpretation relates back to the following quote from the JAMA paper about unknown clinical importance for the statistically significant data:

Meaning Hospitalized patients with moderate COVID-19 randomized to a 5-day course of remdesivir had a statistically significantly better clinical status compared with those randomized to standard care at 11 days after initiation of treatment, but the difference was of uncertain clinical importance.

Source: https://jamanetwork.com/journals/jama/fullarticle/2769871#:~:text=Importance%20Remdesivir%20demonstrated%20clinical%20benefit,with%20moderate%20disease%20is%20unknown.

I always found that quote to be really interesting. I think it begs the following questions: are we trying to get statistically significant data; and/or are we trying to improve clinical outcomes?

1 Like

I see no justification in the paper for that statement at all so am still awaiting particulars. But here is an example where the simplest outcome is difficult to interpret clinically: Suppose that a therapeutic target is to get patients to “respond” to treatment using some arbitrary dichotomization that clinicians pretend to understand. Suppose that for treatment A 0.34 of patients respond and for treatment B 0.42 of patients respond. Is the difference between 0.34 and 0.42 clinically significant?

i can’t help but be pragmatic, ie if the clinicians are saying in a review of remdesivir trials (“Efficacy of Remdesivir in COVID-19”) that it’s difficult to translate the scale “into a clinically meaningful statement for patients, clinicians, and policy makers" then i regret the wasted cost on a trial that had little chance of producing a result that could be readily accepted/communicated and instead incites debate (eg statnews). When you have these multi component scales it’s hard to sell, ie what is driving the result. I saw vinay prasad on twitter rebuke pfs (tweet) and i just feel, in an analogous way, these measures are susceptible to scepticism. In a old stat medicine paper (i can’t remember the author) they refer to an RCT as a gladitorial contest with a single victor. Composite measures will tend to push us in this unfortunate direction with fixtaion on a single p-value: “did the drug win?” That’s my worry

edit: this is a more relevant paper where clinical people regret these ordinal scales in covid studies: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7265861/: “Our analysis shows that most trials testing new treatment options for SARS-CoV-2 include a surrogate measure which may or may not predict clinical benefit.” etc

i enjoyed reading it and should reread it, but isnt it ironic that you quote senn because he has been critical of that measure: “Random sampling does not take place in clinical trials and overlap measures make most sense in a sampling context” (senn 2011)?

I note that the McCreary & Angus editorial you cite contains a major misunderstanding of ordinal outcome scales. Such scales simply do not assume equal severity spacing across levels of the scale.

I’m not seeing how that context relates to ours.

These are very physician-centric views of the world IMHO, as if what matters to patients is not relevant. It also assumes that a worsening condition is not sufficiently predictive of a tendency to have clinical events. On the first point we know from the international survey of thousands of patients that they place great weight on shortness of breath.

The progression-free survival issue is a good one to bring up, but there are two differences: (1) we have a lot of data showing how reducing progression does not lead to improved survival, for many cancers/treatments; and (2) in a chronic disease requiring long-duration expensive treatment and long-term follow-up patients have different utility functions. In many cases patients have elected to sacrifice quality of life voluntarily and have put their emphasis on life extension.

To keep the good discussion going, here is way of stating it that is a slight exaggeration: You would rather spend the time and resources to do a 7,000 patient COVID-19 clinical trial on a mortality endpoint than do a 700 patient trial on a full-spectrum ordinal endpoint. You would wait for 7,000 patients to jettison ineffective treatments instead of stopping early for futility with fewer than 700 patients with a multi-level ordinal endpoint just because it takes a bit more time to interpret.

And what is so hard to interpret about an ordinal outcome? SImplifying to 5 outcomes levels (at home with no shortness of breath, at home with significant shortness of breath, hospitalized, invasive ventilation, death), the interpretation of the trial can be stated these ways simultaneously:

  • The estimated probability that a randomly chosen patient given treatment B has a better clinical outcome than a randomly chosen patient on treatment A is 0.7
  • The estimated probabilities for treatments A and B of the patient having significant shortness of breath or worse are x.xx and x.xx
  • The probabilities of the patient needing hospitalization or worse are x.xx an x.xx
  • The probabilities of needing invasive ventilation or dying are x.xx and x.xx
  • The probabilities of death are x.xx and x.xx
  • The Bayesian posterior probability that treatment B affects mortality differently than it affects shortness of breath, hospitalization, or need for a ventilator is x.xx

maybe i’m not reading it properly [throughout i am reading concordance prob = IEP (the term senn uses)] When you show “there is an almost one-to-one relationship between the odds ratio (anti-log of) and the concordance probability” in the presence of non-PO we should be reassured by this because we are familiar with and at ease with the concordance prob? although senn’s SBR paper is more ambivalent than his original response to acion’s stat med paper it still elicits pause. For the jama example above you’d recommend reporting the OR?

Yes and I’d report the concordance probability P(B > A) which is a simple function of the OR even when proportional odds doesn’t hold - see here .


I do not think this has been very rigorously evaluated in most cases. Very often what happens in oncology trials is that a “statistically significant” progression-free survival benefit is found whereas “p > 0.05” for overall survival benefit (keep in mind that among other things the overall survival endpoint naturally has lower number of events). This is mistakenly interpreted as “improving progression-free survival but not overall survival”.

That is the predictable outcome because the trials typically don’t have Bayesian or frequentist power for a mortality comparison. But to me the bigger issue is that only a minority of p < 0.05 for PFS translates into evidence for a mortality benefit in a later, larger trial (or one with longer follow-up).

I wish it was that simple. This paper by Kristine Broglio and Don Berry gives some context from a statistical perspective and shows that in malignancies with long post-progression survival the overall survival endpoint may be misleading compared with progression-free survival. For clinical context on this finding see here.

My expertise is renal medullary carcinoma (RMC), a highly lethal (and unrecognized) kidney cancer that predominantly afflicts young individuals of African descent. If left untreated, RMC will kill patients within 3 months. When the first-line therapy for this disease was established a few years ago, the lengthening of progression-free survival naturally increased the overall survival to a median of 13 months. As we are developing second-line (and later line) regimens for this disease, the improvement in progression-free survival almost perfectly corresponds to an increase in overall survival. Aggressive cancers with few options like RMC tend to behave like this.

But even for the most common kidney cancer (clear cell renal cell carcinoma) we have summarized in Table 1 of this article the progression-free survival and overall survival results that have led to all the FDA approvals to date for kidney cancer (all FDA approvals for kidney cancer systemic therapies to date have been based on clear cell renal cell carcinoma). There is not a single instance where the progression-free survival showed benefit and the overall survival (with either longer follow-up or in a subsequent trial) did not.

Disease progression in oncology is an ordinal outcome (in solid tumors it is comprised of: complete response, partial response, stable disease, progressive disease, death) and indeed has parallels with the COVID-19 ordinal modeling. At least some of the arguments for and against COVID-19 ordinal outcomes are very similar to the arguments for and against using disease progression as an oncology outcome. In most oncology trials, progression is most commonly used as a time-to-event outcome (or erroneously dichotomized into “response” vs “no response”) but see here as an example our paper jointly modeling progression as an ordinal and time-to-event outcome in a phase I/II design setting.

Note that I am a harsh critic of the RECIST criteria most commonly used to define progression in solid malignancies, and the trial designs I am currently involved with use utility functions to take into account broader considerations like patient preferences. It does feel odd that I have to defend disease progression (and the RECIST criteria behind it) but my point is that the topic is substantially more complex than what is claimed in some corners of the Twitterverse.