Henry Ford Observational Study on HCQ in COVID-19-- failure of basic peer review

I was going to post this directly to Twitter, but Frank Harrell suggested I post it here. And since Frank is one of the smartest people I know, I thought I’d give it a shot.

In the last 48 hours, there has been a fair amount of buzz in the lay press about a new observational study of hydroxychloroquine (HCQ) in COVID-19 (https://www.ijidonline.com/article/S1201-9712(20)30534-8/fulltext), which demonstrates a very strong association between early treatment with HCQ for hospitalized patients and a substantial (50% or greater) reduction in the risk of 28-day mortality. There have also been a few brief questions raised about some of its methods on Twitter (where I usually hang out as @djc795). So, last night, I thought I would take the time to read the whole paper and jot down my thoughts.

The TLDR version of this is “the paper has a lot of major methodologic flaws”—more than should have gotten through even a cursory peer review. Of note, there is nothing really profound here—just a bunch of fairly obvious issues. And I’m not even going to mention the completely obvious one-- confounding by indication.

I know that the Henry Ford group who published this are well-meaning. In fact, they were one of the first groups to start an RCT testing prophylactic HCQ, which is still ongoing. I am hopeful that they’ll be able to complete enrollment in that trial soon so that we can add to the scientific literature on HCQ in COVID-19. In the meantime, here are my basic concerns about the paper…

  1. Immortal time bias—median time to starting HCQ was 1 day (IQR 1-2), and the KM curves show that by day 2 there was already an 8% absolute mortality benefit with HCQ, which increased to about 11% on day 3.

  2. Competing risks-- follow-up was in-hospital only but the authors failed to incorporate discharge to home as a competing risk (same issue as in the single-arm compassionate use Remdesivir trial published earlier this year in NEJM).

  3. Throwing away lots of perfectly good data. As best I can tell, the authors adjusted for 18 covariates, all of which were coded in a binary fashion including age, BMI, serum creatinine, and admission oxygen saturation. Why would anyone do this?

  4. Inclusion of multiple post-admission covariates in risk-adjustment and the propensity score—these include need for ventilator, ICU admission, and use of steroids and tocilizumb at any time during the admission. They could have easily included these as time-varying covariates. Of note, need for ventilator and ICU admission may be directly affected by the treatment.

  5. Exact matching on the propensity score in the propensity-matched analysis, which seems to have resulted in completely identical patient populations with respect to the covariates of interest. In addition to producing a truly remarkable Table 3/Love Plot, I can’t figure out a good reason why anyone would have done this at all. At a minimum, it probably results in a serious loss of power.

  6. Failure to account for date of admission or to report use of HCQ over time. Presumably, increased use of HCQ occurred later in the series—a time at which the health system had learned how to care for these patients much better.

  7. The authors argue that these results show that the benefit of HCQ is highly dependent on the timing of administration, because previous studies of severely ill patients (RECOVERY) and of very early administration (post-exposure prophylaxis in the Boulware et al NEJM paper) have both shown no benefit. If the benefit seen in the Henry Ford study is real, this would represent the narrowest therapeutic range I’ve ever seen.

There probably are others that are more subtle as well, but these were all fairly obvious. Quite frankly, any competent peer reviewer should have picked up on several of these, so a lot of my issue here is with the journal (International Journal of Infectious Diseases).

  1. Yea, sure, in almost all studies with HCQ and AZT the effectiveness is noted between first or second day in mortality/ICU (doi.org/10.1016/j.tmaid.2020.101791, doi.org/10.20944/preprints202007.0025.v1 —I only can post 2 links—).

  2. Not needed as mortality was the outcome. 86% of all patients remained in the study, what is by far a lot more than in any clinical trial. Overall crude mortality rates were 18.1% in the entire cohort. Median inpatient LOS was 6 days (IQR: 4-10 days) and the median time to follow-up was 28.5 days (IQR 3-53).

  3. Because is more robust in a retrospective study and don’t impact the analysis of the primary outcome.

  4. Why? Which would be the impact in the result do that? 5 days of treatment was the variable.

  5. Of course are identical patient as is a retrospective study you can do it. They match the severity. Yes, it lack of power but is an interest measure. The next step is a clinical randomized trial.

  6. Is a retrospective Study in Six hospitals with 2948 COVID admissions. Presumably many things happened, that’s why a retrospective trial is less robust.

  7. They initiated the treatment at the time of hospitalization (Median time (IQR) from admission to receipt of hydroxychloroquine was 1 day) They don’t claim that the benefit of HCQ is dependent of the administration initiation since they don’t say when the symptoms began.

Thanks- here are my thoughts…

  1. Considering all the studies you cite are observational studies with exactly the same issue, I do not find your argument about the timing of mortality benefit convincing.

  2. I think you may be right on this, but 28-day mortality (rather than in-hospital mortality) would have been a bit stronger endpoint.

  3. I’m not sure what you are talking about. The analysis that was done in this paper treats a patient who is 66 years old the same as a patient who is 86 in terms of risk-adjustment. And a patient who is 34 the same as a patient who is 64. Given what we know about the age-dependence of COVID-19 mortality, this doesn’t make any sense. Same thing for a number of other factors that undoubtedly influence prognosis (creatinine clearance, BMI, initial O2 sat).

  4. Because you don’t adjust for complications of the disease you are studying that arise after the treatment decision is made (intubation, ICU admission). And if you adjust for downstream treatments (steroids, antiviral therapy), they should be done in a time-dependent fashion.

  5. Again, I am not sure what you are talking about. Why make a propensity score if you are going to exact match on all of the covariates of interest? All it does is cut your sample size dramatically, which widens your confidence intervals.

  6. Failure to take into account the learning curve in care of COVID-19 when use of HCQ is increasing at the same time is a huge source of confounding in this study and many others of HCQ and other potential therapies (https://twitter.com/leorahorwitzmd/status/1279104167336980481?s=20). This is exactly what makes relying on observational data that emerge during a pandemic when many aspects of care are changing at the same time almost impossible to pull off.

  7. Much of the discussion of the paper centers on how treatment in this study was given “early” whereas in the RECOVERY RCT, it was given at all stages of COVID-19. Yet on the other hand, in the one RCT in which HCQ was given even earlier for post-exposure prophylaxis (https://www.nejm.org/doi/full/10.1056/NEJMoa2016638), there was no benefit either. If we accept these 2 RCTs as valid, then we are talking about “shoe-horning” a treatment into the window between when the patient becomes infected and when the disease becomes clinically significant. While I agree that there is still equipoise to test HCQ in the early stages of disease, these 2 pieces of evidence on either end of the disease spectrum make it less likely that HCQ would work in that setting.

1 Like

I cannot follow the argument against the study made below in relation to the outcome specified. I see it was later retracted when challenged but before I saw that it was retracted I simply could not grasp the issue.

Please explain the original basis for this argument in relation to the specified endpoint. (Genuine interest).

“2. Competing risks-- follow-up was in-hospital only but the authors failed to incorporate discharge to home as a competing risk (same issue as in the single-arm compassionate use Remdesivir trial published earlier this year in NEJM).”

1 Like

You are correct-- as I wrote in my response above, I made an error in comment #2 in my original post. The point I was trying to make (which is a minor one) is that use of in-hospital mortality as an endpoint-- while common in the COVID era can be subject to some bias because of differential lengths of stay. A more conventional endpoint would be mortality assessed at a fixed timepoint (e.g., 28 days). I apologize for this confusion on my part. Since the authors used survival methods for comparing mortality, I do not think this is a major issue for this paper.


FWIW, if you have not yet seen it, there is a commentary made by Lee et al on the Henry Ford paper here:


which was published on July 2.

Yes- I saw it after I posted this. They make several of the same points. Thanks for noting it here.

1 Like


Its a complex issue because sometimes trached pts are “discharged” from the hospital but die in the secondary facility. This “dicharge” reduces ICU and hospital mortality but not true mortality.

I agree 28 day mortality is the right outcome signal but its harder to acquire.

Yes, the arguments are right, but again, are the same concerns that in any retrospective study. Nevertheless, the results are there, and are pretty consistent and they go in line with other several studies about HCQ. The only way to put an “end” to the discussion is doing a clinical randomized trial what seems difficult to do because the political interest. Also I doubt Gillead allow to use it’s drug to do the trial. There are not doubt that the ventilatory management has been improved in this months. Also something I find interesting about this cohort is the fact that, against of what I supposed, in the matched analysis of severity, they didn’t find differences in the use of any of the drugs. Is just an small proportion of the total cohort and maybe don’t have any statistical value but is a first step to implement a randomized trial.

Why the WHO stopped the clinical trials with HCQ? Why they allow, with some studies worst designed, the use of remdesivir? Why is USA buying all the product, and EU allowing that kind of treatment and forbidden the use of HCQ? This is something that never happened before in the history of medicine. We, as doctors, have being using off labels indications of many drugs to try our patients and never, none “authority” has been banned physicians for its off label use. Now, the media try to satanize HCQ because its cardiology side effects when those are very rare and almost all events are case report ( https://dx.doi.org/10.1177%2F2048872612471215 ).

Sad, truly sad.

I have long studied how the social behavior of scientists affects the way they speak and act as a group, particularly as it relates to statistical analysis.

Here one can contrast the studies relating to anticoagulation as a treatment for severe COVID with HCQ.

Anticoagulation has not been politicized so it is the index case where scientists present their typical dialog seeking truth. HCQ is highly politicized so it provides the example of behaviour under the influence of strong bias. I feel sorry for the Henry Ford scientists as the article cited above cites a laundry list of potential residual confounders, many of which no one would have included in the study and even questions why they even had a HCQ protocol in the first place.

My young son said it best to me. Here is his quote as I remember it.

“Dad the real problem with politics affecting science is that we cannot determine the truth.”

I want to acknowledge that it was not scientists who politicized HCQ. No one disagrees with exposing the fool. Yet the public is not expected to be disciplined in control of bias but we are. When it vome to scientific study and discouse relating to such study, it’s our duty. The assumption that we will always do that no matter the stakes is why we are given such deference.

Here is the commentary which was cited previously. Note the authors in the end appear to lament that the Henry Ford study was even performed. These are the clues you are looking for when investigating systemic bias.



There are several cofactors that impact in the results of a cohort, is not realistic to despite a cohort just because those confounding variables:

Immortal time bias: both groups were accepted in the cohort in the same circumstances. The author, Dr Lee, claims that “several” time-related covariates weren’t modeled. That’s true.
Lets start with the basic in a pneumonia:

  1. RR.
  2. SpO2
  3. PaO2/FiO2
    So, there was not measured the severity of the respiratory distress. Neither classified in middle, common, severe or critical. Wrong!

Ventilatory measures:

  1. Common O2 therapy: intranasal cannula, common oxygen mask or reservoir masks. Here the Henry Ford team fail to say the amount of O2 L/min: 5, 10, 15 l/min. That’s an important measure.
  2. HFNC
  3. NPPV
  4. IPPV
  5. ECMO
    They don’t give information related with the ventilatory management and don’t give any value of SpO2 of the patients, neither the RR and how was the procedure depending the grade of respiratory distress. This is a major confounding because the ventilatory management is the most important measure of survival.
    Systemic variables:
    They fail to measure important predicting variables reported since SARS in 2002:
    Age, Gender, Neutrophils, PTT, Sodium, Urea, creatinine, creatinine kinase, diabetes and comorbid diseases and D-Dimer. This variables are determinant in the prognosis of the ill.
    Pharmacologic variables:
    They fail to analyse the drug interaction of the prescribed drugs and it’s impact in the outcome: Heparine, Antibiotics, Antivirals, Analgesics, diabetes drugs, hypertensive drugs, etc. Many complications should be, presumably, related to interaction between drugs and side effects. Also, the good control of glucose and blood pressure must be determinant in the success in survival.

Also they don’t explain the protocol to use HCQ, AZT, HCQ+AZT, or other drugs. It seems, by the implementations of the treatments, that the physicians did the prescription in relation of their despair and frustration than based in any scientific criteria.


Shame on you Henry Ford Team! You should learn to do a cohort first!

1 Like

Getting back on track to a real question.

Dr. Lee suggests that ferritin levels should have been considered. Here is a study examining mortality in relation to biomarkers including ferritin. Anyone who has done this type of work knows that very many mortality signals known and unknown are present in severe illness.

What is the proper course ? Is Lee correct.? Real question.

I’m no expert in coronavirus, but I think the main point that Lee was making was that in an observational study like this one, it is important to adjust for measures of disease severity. There are several factors that could have been considered including ferritin, room air O2 sat on presentation, D-Dimer levels, troponin, etc. I’m not sure any one of them is better than the others.

Yes, this is an important point. The problem is that the number of signals of disease severity prior to study entry is often massive. These include lab, vitals, demographics, comorbidities, ect. Furthermore its not just the value of the signal but its trajectory before treatment which predicts mortality. Even futher its not just the trajectory but the relational time series pattern (in relatiin to ither signals) which preficts mortality.

Looking at the 50K cases of Mimmic III database we can generate mortality over a 15 level range for over 100 signals including their trajectory and relational patterns and that’s not even considering comorbidities, the medications they are receiving on admission, their home zip code, it goes on and on.

For example, I have never seen a pneumonia study which used ferritin. How do we know if we should. Where do you stop. Many studies use a few of these plus the SOFA score ( a sum of scores from the 1990s responsive to thresholds of a few signals).

Does ordering 10 lab tests prior to entry improve mortality assesment in the study? How about 30?

For example a rapidly rising IL6 might matter but what about the time patterns of all the other cytokines in relation to the time patterns of the other lab, treatment, vitals etc.

So when these critical care trials are done we all know that these mortality signals are there. One of the things I show students is that if you examine massive time series matrices you find that mortality signals and their derivatives are ubiquitous. Its a false sense to select a few or calculate SOFA or APACHE 5+ .

Given that the potential list is massive.a study can always be criticized for leaving some out so when that criticism is a laundry list it raises the question of selectively targeting a study and that was the point made earlier here, but the real question for this group is …

What is the best method to objectively determine the sufficiency of severity adjustment?

1 Like

This weekend’s announcents linked below fit perfectly into this important discussion of adjustments for mortality, etc.

Does anyone have a link to the data discussed in these announcements?
Any thoughts on these announcements?



Gilead’s observational study seems to be a classic example of faulty design and faulty analysis, at least by what’s described in the press release.

1 Like

“The findings show that 7.6% of patients treated with remdesivir died compared with 12.5% of patients in the analysis who did not receive remdesivir treatment.”

“it reduced the risk of death for severely sick coronavirus patients by 62% compared with standard care alone.”

Ratio’s don’t match so I’m assuming the “severly sick” is a subset.

“The new data found the most benefit for severely ill patients under age 65, people who didn’t require as much oxygen”

If they did not require as much oxygen in what way were they severely ill.

“While outcomes varied among ethnic groups — patients were white, Black, Asian and Hispanic or Latino — and regions, the data confirmed that remdesivir isn’t as helpful for the sickest patients, Marks said.”

So this is apparently the definition of the subgroup…The severely ill, under 65 who don’t require as much oxygen, but not the sickest.

This is an interesting quote

“There are no plans to submit the data to a scientific journal.”

Look I’m just the antidomatic anti-fake science guy here. None of this, including the lack of push back, surprises me. In fact its exactly what I predict.

Here is the actual press release which has “confidence” intervals and p values (such as they are).

Can someone provide a critique here on Datamethods of this study responsive to this press release and in comparison with the NEJM RCT. This critique could be similar to the HCQ critique which was the initial post in this thread (which was both well reasoned and informative).

@germhunter provides a tweet and the comment(s) which follow are interesting but we need more depth.

I sincerely hope the drug reduces mortality, even by a small amount. As a pulmonary and crit care doc, I have just one question…Is there sufficient evidence to support its use? Maybe the best question might be…

Would a fully informed COVID-19 infected medical statistician aggressively demand to receive remdesivir?

Sorry thats @germhunterMD
Also see thread:

A basic criteria for CRITICAL illness is mechanical ventilatory support.