Validity of comparing two different previously published cohort studies using IPTW for matching

Hi all!
Clinician here, well aware of my statistical limitations - thought I’d get some perspective from the experts on a recent paper published in Stroke: Hematoma Expansion and Clinical Outcomes in Patients With Factor-Xa Inhibitor–Related Atraumatic Intracerebral Hemorrhage Treated Within the ANNEXA-4 Trial Versus Real-World Usual Care | Stroke

This paper sought to compare the effectiveness of two methods of reversing DOAC anticoagulation, andexanet alfa and four-factor prothrombin complex concentrate, in patients with intracranial bleeding. They did this by taking patients from the ANNEXA-4 cohort (a prospective single arm trial evaluating the efficacy of andexanet alfa published in 2019) and patients from the RETRACE-II cohort (a retrospective cohort trial evaluating the outcomes of “standard of care” (mostly 4F-PCC) in patients with ICH on DOACs published in 2017) and matching them using IPTW with the primary endpoint of change in hematoma volume.

I have two questions. One is whether this is statistically sound - taking two different cohorts of patients who received care in two different time periods and comparing them using IPTW. My understanding of IPTW is that patients in each cohort have to have had the possibility of being treated with either treatment. Andexanet was not available during RETRACE-II, so this would not be so; 4F-PCC was the only reasonable treatment option available at the time. The two patient populations were also treated in different places in different times with different standards of care (the care of ICH has evolved quite a bit over the last 5-10 years).

The other question is one of research ethics - the authors took two previously published single arm cohorts with known outcomes and put them through a post-hoc statistical analysis. I’m curious whether this sets a precedent of being able to pick published cohorts which fit with a narrative then comparing them knowing what the outcome will be from the start of the project. I suppose there is the chance that the matching could have nullified the differences in outcomes known from the start, but I feel like that is a stretch.

This very well could be a completely valid research technique I’m just not familiar with, but curious to hear everyone’s perspective!



Just from reading the abstract (could not find copy of entire paper yet) it looks like this research method is a combination of historical control and meta-analysis (unless I misunderstood something from abstract).

Can you say if the authors had access to individual patient data?

The critical question: Are the patients extracted from the studies comparable? It is going to be tough to make a persuasive case (absent other background information) that there are no other differences that escaped statistical adjustment. All things considered, this method might be valuable as an exploratory tool for design of more rigorous studies, but I wouldn’t want to make any major decisions based upon it unless I had no other choice.

Here is some FDA guidance on historical controls.


Yes, the authors had access to patient level data, so a bit closer to historical controls than meta-analysis. From a clinical perspective the overall management likely different, but you’re right that otherwise baseline variables likely could be adjusted for. Agree that major clinical decisions shouldn’t be made from these findings - ultimately we already knew an RCT is needed to address this question and in my opinion this paper just muddies that already murky waters of what the ideal reversal strategy is, but that’s a clinical opinion and not a research design opinion.

Here is a link to the paper on my personal drive: Hematoma expansion and clinical outcomes in patients with factor xa inhibitor bleeds_Stroke_2021.pdf - Google Drive


Could you also post the supplementary materials on your Google drive. The details of the statistical methods appear to be buried there.

To me, the methods are hard to follow in the main text.

I’d be interested in your opinion about the state of the literature and clinical thinking about any difference in clinical prognosis and/or for volume of hemorrhage in patients who have an ICH while taking rivaroxaban compared with having an ICH while taking apixaban (independent of being given 4F-PCC or andexant alfa). I’m guessing there are not trial data to make a direct comparison but clinical intuition seems important. After the exclusions, 80% of patients in RETRACE-II cohort used in the analysis were taking rivaroxaban compared with 42% in ANNEXA-4.

Sure thing, here is the supplement: Hematoma expansion and clinical outcomes in patients with factor xa inhibitor bleeds_supplement_Stroke_2021.pdf - Google Drive

Clinically, I don’t think the hematoma itself would be different sized depending on the DOAC, but a greater proportion of patients taking rivaroxaban would have ICH given it achieves higher peak activity than apixaban. This is consistent with clinical trials findings and underpins a general move away from using rivaroxaban towards apixaban in recent years, which is likely why there’s the difference in what patients were taking in the two cohorts.

Took me a while to get to the supplementary materials, which are detailed and comprehensive. This kind of analysis (using “real world” observational data and/or post-hoc subgroup analysis of clinical trial data) to draw causal inferences about medications seems to be “in vogue.” Given the timelines for clinical trials, the expense of trials, and the availability of large datasets that contain clinical detail (e.g. from EHRs and Medicare), the trend to do this kind of analysis seems like to accelerate (opinion). The main text mentioned matching on the propensity score, which could have been a big problem given the small number of patients eligible for the analysis. But there is no mention of matching in the supplemental materials and (relief) the propensity analysis appears not to have used matching.

In this case, the number of patients from the two study cohorts (clinical trial cohort ANNEXA-4 and observational cohort RETRACE-II) remaining after exclusions seems small (max of 85 from ANNEXA-4 and 97 from RETRACE-II). With exclusions for missing data, this number becomes even smaller for some of the outcomes. For the analysis of hematoma expansion, missing data on imaging reduces the number in the analytic dataset to 80 from ANNEXA-4 and 78 from RETRACE-II. The validity of causal conclusions based on the analysis depends entirely on belief in the modeling. The propensity score models were built using logistic regression and limited to a max of 7 variables. The choice of these variables seems very important.

I was a bit surprised that a baseline blood pressure variable (could be mean BP, diastolic BP or systolic BP but not more than one blood pressure variable) was not forced into the propensity model a priori given the large differences between the ANNEXA-4 and RETRACE-II patients at baseline. But perhaps BP does not predict the outcomes and was omitted for this reason.

This is another question for the clinician/subject matter expert. Would be interested in your assessment of whether blood pressure at “baseline” (when the ICH is first recognized) is likely to affect the outcomes studied independent of treatment.