Ethics: Publishing Simulation Research without the Data Owner's Consent

I teach statistical collaboration. As part of the students’ ethics training, I posed the following on their discussion board:

While working on a project with Janet, a PI, John, a biostatistician, discovers that her data have unique characteristics that will lead to a groundbreaking methodological paper that will impact the way certain RCT’s are analyzed and significantly decrease the cost associated. Unfortunately, Janet does not want John to use her data for this purpose. After, multiple conversations with Janet to no avail, John becomes frustrated. Determined not to allow this opportunity to pass, he simulates Janet’s data and writes a paper that is published in a high impact journal.

Comment on the ethical concerns in this scenario. Some possible questions to ponder:

  • Does it matter if Janet’s research has been published already?
  • Are there concerns that John, “having the data in hand” will truly simulate data different enough from the real clinical trial data?
  • How would you have approached Janet?
  • Are there professionalism issues involved?
  • Which ASA ethical guidelines are violated or are questionable?
  • If John, based the simulations on summaries from the data, is this OK?
  • What is John’s responsibility to the research and statistical community?
1 Like

What is the question you are asking of us?

What are your thoughts about the scenario? Was John right? How should he have handled the matter. Etc.

I’m not sure this is really within the intended scope of DataMethods, but I’ll bite anyway. FYI, I have no training in statistical ethics that I recall, other than “you can’t claim to have a scientific mindset, and then lie with statistics”, and only a modest exposure to research ethics.

The scenario description is unclear on how John “simulates Janet’s data.” I’m going to assume that he ran (a copy of) the actual data through some process that “fuzzed it.” (As opposed to creating totally manufactured data, based on his memory of the key aspects of the data, which is still problematic.) And then he used the fuzzed-up version.

The PI has control of the data, and was unwilling to authorize this use. That makes this, IMO, a clear case of misappropriation.

That’s a violation of at least the expectations of a collaborative research team/effort. If there’s a formal non-disclosure agreement, then John has violated that contractual agreement, and that agreement likely specifies remedies and/or penalties.

But wait, it gets worse.

If this is patient data originally generated in the normal course of patient care and business, then John’s actions would also likely be a HIPAA violation, and should be reported as a probable criminal violation to internal institutional compliance officers. (I’m not sure offhand if a person with knowledge of such a situation is also personally required to directly refer the matter externally to federal officials … but if you or I had such knowledge, it would be wise to check the regulations.)

And if this is patient data, and the usage is unauthorized research, then John’s actions should also be referred to the Institutional Review Board, which would follow its standard procedures for deciding how to censure John. His actions will have (arguably) risked the institution’s reputation, created questions of lack of institutional control, and possibly impacted the ability for its entire affiliated research community to get grants.

The IRB has strong motivations to swing a Big Stick on violators, as a lesson & deterrent to others.

But you asked, in part, what John could have done differently.

He could have:

  • made a more persuasive case to the PI about the implications and value of what he wanted to do;

  • offered to have her co-author his paper;

  • tried to get the PI’s peers or dept’l chair to support his cause;

  • initiated his own project dedicated to his research question … possibly funded, possibly not … and acquire new data to support the research;

  • been more patient;

  • recognized that his lack of patience/control/ethics was a serious problem, and changed careers before treading down the slippery slope to cause such embarrassment and heart-ache to his parents. Perhaps to a gig-economy driving job, or as a cook in an all-night waffle emporium.

Wanting to be first to publish some novel research … even important research … is pretty slim justification for such a violation of trust.

(Edit: I’m in the USA. Federal HIPAA regulations here protect patient data privacy, and define allowed usages. Other locations may or may not have equivalent legal protections.)

If the statistician doesn’t need the actual data to write his methodological paper, how does it have anything to do with the clinical trial in which he collaborates with the PI?


I think this is going overboard. The simulated data are very unlikely to allow re-identification.

1 Like


There’s an obscure role called (+/-) a certified HIPAA Data Broker. Those people take HIPAA data and hand de-identified data to authorized researchers, and serve as a fire wall. Because they have access to, and knowledge of, the pre-safed data, they are forbidden from participating in, or advising, the reasearcher/s. (Edit Add:) Which is to say, there’s a sanctioned process for getting de-identified data (or there should be.)

Here, John did the data sterilization himself, without prior authorization from the IRB. Even if he uses de-identified data for the research itself, I think he’s at HIPAA jeopardy for the DIY sterilization process.

1 Like

Something similar happened to me with a critical reanalysis of clinical trials from Kaplan-Meier curve reconstructions. A person from the pharmaceutical industry found out and kindly explained to me, that if I published those reconstructed datasets, I would be committing a violation of law.
In the end, to avoid problems, I decided to submit the article anyway, but to make the database available on request, instead of posting it directly as attachments.
Do you think it is unethical to publish the reconstruction of a database from the Kaplan-Meier curve? For me it is perfectly ethical, because the already KM published figure contains all the information, so you are not really revealing anything that is not already known.

If digitized points from the published K-M curve represents real data, then the data have already been published and you are OK anyway. So I completely agree with you. You’re in an even better position than you think because KM points are not raw data but are summary statistics.

Digitizing curves is fun, because you can play around wondering what would have happened if… if a Bayesian model had been applied instead of a frequentist, if a cure model had been applied, if a parametric, dynamic, Royston-Parmar, etc. model had been applied. The nice thing would be to publish them so that others can quickly access the data, and perform their own analysis.

1 Like