Discussion on reconstructing data

Hi all,
Just curious and wanted to gather some thoughts on not necessarily the ethics but the appropriateness and utility of reconstructing unpublished data. I have been digging into survival analyses in clinical trials and have run across instances where KM plots and associated data (e.g. # at risk, # of events) have been used to generate unpublished group data to perform further secondary analyses (e.g. using R-package KMSubtraction).

My first thoughts were these KM-methods are an extension of meta-analyses. However, these methods use published data to generate unpublished group data. There are obvious limitations to any result interpretations here. I also believe there is an argument and push to publish all group estimates and comparisons, negating the desire to generate unpublished data.

I do think the methods to extract unpublished group data from KM plots are interesting, but just curious what everyone else thought on the matter. Thank you in advance for your thoughts and discussion!

Interesting discussion, you could question whether the group data is truly unpublished if you can recreate it from published figures/tables. In a sense, this was always possible but just manually laborious and now facilitated by technical developments.

Main question for me would be to see how accurate these data can be recreated. KM plots can be quite coarse so I can imagine there is uncertainty in the recreated group estimates. Another issue would be what to do with these data? I think in most cases you will lack information of the other study variables at the given time points for the reconstructed groups that are still at risk?


I agree with what @scboone said, particularly in the quote above. I don’t think you would be able to do anything with this “unpublished group data” (@bbulen do you mean summary statistics when you say this?) aside from confirming a paper’s results. So I would ask: what is your goal in reconstructing the data?

In general I am in favor of publishing/posting all raw data (perhaps with some limitations if I were to consider it more), which would of course require infrastructure to support it and tools for authors to easily submit the raw data and so on. This would make it easier for others to confirm published results, do exploratory/secondary analyses, etc. And you wouldn’t have the issue of trying to reconstruct group data from plots/tables/summaries which will necessarily (I think) be “noisy” in some ways (on top of whatever “noise” that is going to be present in any dataset).

My take is that if someone has included an empirical distribution in their paper they have already given you the raw data. There is a caveat: Kaplan-Meier estimates should not be included in papers for a different reason: they don’t allow for covariate adjustment. So K-M curves are not full-information representations of the data since they ignore covariates. The fact that primary analyses are still not covariate adjusted is irksome.

I recognize that you per se are not interested in the ethics of generating group data from the KM plots, but I nonetheless found it to be an interesting ethical question. ASA Ethical Guidelines suggest that divulging where the data come from is important as well as stating its limitations - precision being the concern in this case. Even though the data would be “unpublished” you should also consider the welfare of the subjects and the stakeholders that may be affected by reconstructing the group data. For example, how would/should the results be interpreted if the KM curves contradict the “unpublished” data?