This editorial appeared in a veterinary journal. It espouses producing experiments even when the number of subjects (dogs, cats, horses …) is low. The reasoning is that these trials could later be synthesized to produce a useful result. From an ethical viewpoint isnt it unethical to recruit subjects for a trial when there is a high probability that the result could be uninterpretable? Im thinking things like the winners curse and the absence of evidence fallacy. Often these trial will be considered hypothesis generating or pilot studies but taken by clinicians as confirmatory. Underpowered Trials
Interesting article- thanks for sharing it. These same issues arise in human medicine all the time.
Arguably, researchers who defend conduct of small RCTs by saying that their goal is to contribute to a future meta-analysis should provide reasons to believe that such a meta-analysis will ever be conducted. A nebulous statement that researchers in other parts of the world might share an interest in the topic wouldn’t suffice. Rather, what would be needed, to make conduct of a small RCT more defensible and credible, is for researchers to already have made an effort at collaboration, prior to proposing their RCT. Ideally, anyone planing to conduct a small RCT on that topic would design their study in such a way that it would be “combinable” with the results of previous and future RCTs on the same topic.
Since pre-planning of a meta-analysis of RCTs can significantly increase its credibility (as pre-planning helps to ensure that apples will be meta-analyzed with apples and not oranges), and, since pre-planning requires collaboration with other research groups, why not simply collaborate from day 1 and design a larger, perhaps multi-centre RCT?
To be clear, I have never designed an RCT and this proposition might seem terribly naive to those who do design trials. But, arguably, inadequate foresight (i.e., no clear advance plan as to how a study’s results might realistically be used) has, historically, been the source of a lot of research waste. RCTs in both human and animal medicine should serve a purpose (other than burnishing a CV or dissertation), regardless of their results. If their results simply generate a shrug, something has gone very wrong.
@ESMD’s observations are spot on.
On a purely technical note, in my past I’ve worked with many animal researchers, and their statistical practices make this situation much worse:
- They want to conclude that there is evidence for the absence of an effect when p > 0.05
- They want to use p-values at all with very small sample sizes
- They avoid uncertainty intervals for key quantities of interest; the width of uncertainty intervals is a good way to show the yield of the experiment no matter what the sample size
Yes I review for veterinary journals, and all of these points are accurate. They also frequently run multiple null hypothesis tests and report the “significant” results as the headline finding. I actually think these studies might benefit from a Bayesian approach although I am not an expert in that domain.
A Bayesian approach would not use approximations that fail in small samples, would allow experimenters to be skeptical about large effects, and would improve ethics by allowing for efficient sequential learning that avoids fixed and arbitrary sample sizes.
A reminder: the whole post-hoc power fiasco arose out of attempts to justify small surgical trials.
I was told when I was at Oxford that whenever there was a null result for a surgical procedure the American surgeons would tell the Europeans it was a difference in skill not power.
Similarly in my field with stem cell research. When it doesnt work its because the investigator is not preparing the stem cell product correctly.
There was a lengthy thread last year on the underpowered RCT topic, albeit, it also segued into some discussions about interim monitoring and related topics:
In the course of the discussion, in one of my replies, I referenced the following 2013 paper by Button et al:
Power failure: why small sample size undermines the reliability of neuroscience
On the skill related issues that @trumanfrancis raises above, that is a key reason why in many studies, especially for implantable devices in IDE studies, you will see that for each of the one or more study sites/PIs, there will be an initial series of “run-in” subjects of some modest number, to get over the learning curve issue that may be present and bias outcomes for early subjects.
I was wondering if it would be useful to include in a supplement a dashboard sequencing the evolution of evidence on a topic with estimates and uncertainty from studies across the timeline as well as perhaps indicators of potential bias. Then allow the reader to use the current study to see the impact of different priors on the posterior distribution.
When I started working in clinical trials there was quite a lot of the “underpowered = unethical” view. And yes, people shouldn’t do small trials that are bound to be inconclusive when they could and should do something better. But it’s not black and white. Very often we know it’s going to be hard to recruit the number that we would ideally like - as others have said, Bayesian methods are really useful here.
What should we do in rare diseases where we know we can only recruit a small number? Often not randomising and using some sort of single arm trial design is advocated. I’m not sure about this, as it seems to me that it’s often proposing a solution for lack of precision that sacrifices accuracy - non-randomised trials are likely to much more prone to biases, which are potentially unknown and large. Especially if they don’t do the complicated statistical work of trying to understand and correct the biases (and lots of studies don’t).
That’s a great question. My current opinion is that failing to use 1:1 randomization is resulting in the need for increasing the total sample size in the long run, once we’re honest about (1) inefficiencies of unbalanced randomization and (2) realize that standard errors are not what is important (in the frequentist world) but rather variance + bias ^2 = mean squared error of effectiveness estimates. We bias term will dominate in many situations, coming from the use of non-concurrent or non-randomized data.
Backstory: I’m seeing sponsors want to use historical controls or to do 2:1 randomization in rare diseases, the latter as an enticement to desperate participants who believe (usually falsely) that the new treatment will improve their condition in a big way; they want higher odds of getting the treatment.
There are better ways to accomplish the goals:
- Use 1:1 (which usually lowers the total sample size) randomization and have a more discriminating, more continuous, more frequently measured outcome variable
- Follow patients for an adequate length of time
- After that time convert the study into an open-label extension where everyone gets the new treatment, if the treatment is demonstrated in the main study phase to have some benefit
- Use a Bayesian sequential design so that we can tell patients that if they were assigned to the less effective treatment they will remain on it a minimum of time until there is evidence for it being ineffective.