Stability of results from small RCTs

This is a really terrific question!

Disclosure: I am a trained frequentist clinical trial statistician whose knowledge of Bayesian statistical approaches is presently “Know just enough about Bayesian approaches to read a paper and explain the results to a layperson” - not at all an expert.

Just to briefly recap: the issue is that traditional frequentist null-hypothesis-testing boils the result down to “significant” or “not significant” based on how likely it was to observe the trial’s data if there was no treatment effect, while Bayesian approaches create a posterior probability distribution of the estimated treatment effect (though the pragmatist may point out that in decision making, this is still often going to be reduced to a yes/no decision).

Caveat before I answer: “it depends” and the comments below probably vary somewhat depending on the specific clinical situation, sponsors of the trials, financial incentives in play, etc. With that said, I’ll venture a few comments:

Re: design-related flaws, it is certainly possible that some of the smaller trials which first lead to the excitement have more of these problems than larger trials that follow. This of course need not be an intrinsic feature of small trials, but may reflect that as the stakes are raised, a greater degree of rigor and regulatory zeal is applied, more experienced investigators get involved, and the trial design is more carefully scrutinized. For one example, see this thread:

Briefly, the authors performed a crossover trial of OSA patients that received one night on the drug and one night on the placebo, then did a subgroup analysis that was restricted to patients with the poorest results on the placebo (using the justification that only these patients met the criteria for OSA on their ‘placebo’ night). The authors proceed to conclude that the drug was more effective (“greatly reduces” appears in the title!) for patients with more severe OSA, though my thread illustrates that regression to the mean could explain part or all of these results (they should have used a pre-study measurement of OSA severity if they wanted to do subgroup analyses by severity, not used the placebo night itself as the assessment of severity). I would not be surprised at all if a more rigorous subsequent study (which will also probably be larger) shows less benefit than this result, because the analysis that they used in their paper inherently overestimates the treatment benefit by design. (Yes, I wrote an cranky letter about this, and no, it didn’t seem to make any impression on the authors)

So the answer is Yes, with caveats: sometimes design flaws in early-stage RCT’s (which also tend to be on the smaller side) are better addressed in subsequent larger trials with more oversight (and/or more experienced investigators), and this probably explains some of the problem you describe here.

Re: publication bias, I generally agree that this is a greater issue with studies other than randomized trials, though again it may vary a bit depending on some other specifics. I think some of the meta-researchers have found a disturbingly high number of clinical trials with no reported results, though, but (at least from my personal experience) I think most randomized trials ultimately will find a home in the published literature (and it is my personal belief that all trials should be published - some folks have an issue with the idea that even bad trials should be published because it’s rewarding the authors by giving them a publication, but IMO the results of the science should still be made public, though the journal editors/reviewers ought to point out all issues and make sure they are adequately discussed).

Re: the perils of an “all or nothing” frequentist approach, you are certainly correct that there are perils to attaching too much confidence to the findings from any single trial. Whether further research should be performed, or whether the practice being tested should be adopted, depends on specifics of the clinical situation, regulatory considerations. For example, a drug company may decide that they need a “go” or “no go” decision of whether to pursue something further based on their results of their trial. In an ideal world, the trial would be designed with a more flexible sample size to reach a more conclusive result, but sponsors only have so much money and there are only so many patients / so much time that will be invested in studying certain agents before they are either considered market-ready or otherwise are likely abandoned.

I admit that despite my enthusiasm for the potential of Bayesian approaches, I feel some degree of caution at what we have started to see a bit in medicine: the idea that any “negative” trial can just be reanalyzed using a Bayesian approach and subsequently declared a positive-ish trial.

2 Likes

in drug development there’s a large disconnect between phase ii and phase iii results (or small v large trials). But phase ii will use eg a biomarker and phase iii will have a more definitive outcome eg mortality. Some have suggested using a composite endpoint in phase ii that incorporates mortality to make phase ii results better predict phase iii results. Thus it’s mostly to do with size and outcome and lack of a control group (at least in drug development)

i worked on a phase ii of alfimeprase for arterial occlusion. When discussing the phase iii i suggested a placebo control; my colleagues were aghast because the phase ii results looked so good, one of them even said to me “would you randomise your grandmother to that trial?”., i tried to argue that it was unethical not to obtain a result that was unequivocal. The phase iii went ahead (i wasn’t involved, id left the company). The drug performed so poorly they had to terminate the study early, the drug died along with the company (nuvelo) https://www.ncbi.nlm.nih.gov/pubmed/18457471

small, uncontrolled study on an intermediate endpoint, … id promote scepticism

1 Like

This is an excellent point, and one that I thought about including in my list of the things on which “it depends” for this post. I assume that the OP is referring to not only drug development, where I hope the role of Phase 2 trials versus Phase 3 trials is well understood (although it seems that increasingly people advocate for approving drugs based on surrogates and smaller trials). Rather, I think OP was referring more broadly to trials of medical practices where sometimes the practice can be changed based on a single small trial, without a formal regulatory approval (this can involve drugs, of course, if they are already widely available rather than experimental). Since the OP specifically mentioned early positive trials that led to changes in practice which later had to be reversed, I assumed that she was referring to trials other than drug approvals - since the practice effectively cannot change until the drug is approved - but things that may be changed without a large, definitive Phase III trial.

1 Like

Thanks very much for responding. Please excuse these questions if they are too basic for this platform; I’m an MD without a degree in stats or epi.

Here’s an example that might better illustrate my confusion. Several small RCTs from the 1980s (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1671862/pdf/bmj00157-0017.pdf) suggested a mortality benefit from magnesium in treatment of acute MI. Some clinicians apparently were ready to start implementing this therapy widely, while others advised waiting until results of a larger RCT (ISIS-4) became available. Ultimately, the mortality benefit suggested by the small RCTs (which only accrued tens of deaths) was not supported by the ISIS-4 trial. This 1997 letter refers to lessons learned from the magnesium/MI trials:

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(97)26004-6/fulltext

The authors (prominent statisticians) note:

“…In addition, those who comment on the magnesium trials should take more seriously the extent to which chance can produce misleading results in small trials , and the extent to which the phenomenon of regression to the mean can produce apparently striking discrepancies between the hypothesis-generating results from the meta-analysis of early small trials that helped engender LIMIT-2 and ISIS-4, and the results of the larger more careful subsequent tests, of the magnesium hypothesis.”

So here, the findings from the early, small RCTs were attributed to “chance.” The word “chance” here is not defined. Are the authors saying that the small early RCTs might have, “by chance” enrolled a sample of patients in their trials who were not truly representative of the acute MI population in general?

But then I read this 2017 publication, in which other authors seem to argue that the unreliability of small RCTs is attributable to “bias”:

https://www.sciencedirect.com/science/article/abs/pii/S0895435616303444

They seem to be saying that randomizing only small numbers of trial subjects can result in baseline covariate imbalances which can bias the study’s result and that this is why small RCTs are unreliable. But my understanding from this forum (see Dr.Harrell’s discussion on covariate imbalance in RCTs) is that, for the purpose of valid statistical inference, it is sufficient to know that the tendency was for baseline covariates to be balanced (i.e., that the randomization process itself was mechanically sound), not whether some arbitrary group of covariates that we chose to examine actually appeared to be balanced following randomization. So I guess I don’t understand how any particular distribution of covariates could bias a study’s result (i.e., move its point estimate), provided that the covariate distribution arose from a mechanically sound randomization process.

So, to slightly rephrase my initial question: Is the unreliability of small RCTs attributable primarily to chance “non-representativeness” of small groups of patients (i.e, poor external validity) or instead to methodologic issues with small trials which lead to poor internal validity?

1 Like

i remember the magnesium studies, and how the meta-analysis of all the small studies was contradicted by the 2 large RCTs. I got the data from Matthias Egger and analysed it for my dissertation…

regarding bias, i think it is more to do with publication bias noted by @ADAlthousePhD This paper does a decent job of explaining it ie winners curse and the inflated effect

edit: regarding covariate imbalance, i should have referred to Senn and this blog post by @zad : " Critics of RCTs will argue that because there’s also always the possibility of there being an imbalance of known or unknown covariates between groups, RCTs cannot make proper causal inferences, especially small RCTs that are “unable to distribute confounders effectively.” Unfortunately, there are several problems with these beliefs and approaches." https://lesslikely.com/statistics/equal-covariates/

1 Like

Re “winner’s curse:” I’ve read about this. I sort of assumed that it primarily applied to observational studies (multiple researchers around the world all studying the same question, generating a “universe” of early studies on a topic with a wide range of point estimates; the only subset of these studies that will get accepted for publication are those that achieve p<0.05, which will, for a small study, only occur in situations where the obtained effect size is big; subsequent larger studies usually fail to replicate the finding…). I wasn’t sure how often this phenomenon would apply to RCTs, since they are harder to run and we wouldn’t expect huge numbers of researchers to all be running RCTs on the same question(?)

2 Likes

i think it applies to rct’s, maybe not to same degree. someone just tweeted this: “Treatment effects in randomised controlled phase II trials were greater than those in matched phase III trials. Caution must be taken when interpreting promising results from randomised controlled phase II trials.” EJC current issue

1 Like

Thanks for the links and your input. The frequent “disconnect” between results of Phase II and Phase III trials in drug development certainly supports viewing small RCTs with caution. But after reading many articles on this topic (most of which go over my head), I still can’t tell whether there’s a consensus among statisticians on why results of small RCTs are unreliable (though I’ll admit that I find it difficult in general to tell when statisticians are agreeing with each other :))…I guess as a consumer of medical literature, I’ll remain content with the “small RCT=unreliable” rule of thumb going forward.

There are a couple of thinks you may want to consider on this regard. First, by saying that “small RCTs are unreliable” we are generalizing from “unreliable small RCTs” to “all small RCTs”. This generalization ignores the some small RCTs could be “reliable”. Whether or not a small trial is “unreliable” depends on the specific design characteristics of the RCT. Of course, small trials are more likely “unreliable” than large trials, but this does not means all small trials are “unreliable”. This is akin to stating that patients with high LDL-cholesterol are at high risk of myocardial infarction (“unreliable”), while keeping in mind that not all patients with high LDL-C will develop a MI. To deem a small trial as “unreliable” you have to check how it was design and conducted, and what were its findings. Suppose X is the only prognostic factor for outcome Y, and that you conduct a small trial with random assignment of the treatment (T) stratified on X. If T has a large effect on Y, then the small trial may be “reliable”. The second thing we should take into account is: what do we mean by “unreliable”? Do we mean biased, imprecise, or both? Findings from small trials are less precise than those from large trials because their sample size is smaller. This lack of precision is just a consequence of random error. However, if the findings suggest a null effect of the treatment is unlikely, then the fact that the trial is small should not be a reason for great concern. Small trials are also more likely to be biased than large trials, because assigning at random a large number of individuals will make exchangeability more likely than assigning a small number of individuals. But there are ways to improve exchangeability in small trials, such as stratified randomization (example above).
Briefly, findings from small RCTs are less precise (statistically) than those from large RCTs (assuming they have similar objectives and designs), and are also more likely explained by non-exchangeability of treatment groups. However, the merits of each specific RCT, large or small, should be based on how the RCT was designed and conducted.

I agree with being cautious with the use of Bayesian methods as a way to make “negative” trials “positive” trials. On the other hand, this should be a concern only if we restrict ourselves to interpret the findings as a response to a Yes or No question. This “dichotomization” of reality has been an undesirable consequence of our over-reliance in statistical tests. Yes, we need to make pragmatic decisions regarding our patients: Should w recommend treatment T to our patients? But this is different from assuming there is no uncertainty in the knowledge that supports our recommendation. Bayesian analysis helps us to better incorporate existing knowledge and assess uncertainty in our knowledge. Findings from small RCTs could be improved by the incorporation of existing knowledge from other RCTs and observational studies (regardless of whether they are small or large). Suppose a novel antihypertensive drug (D) has been developed, tested in a RCT (large or small), and found to decrease the risk of stroke by 30 times. Our first reaction would be: Does current knowledge on the hypertension and stroke support such a large effect of drug D? We will, consciously or unconsciously incorporate our knowledge in the interpretation of the 30-fold reduction in risk. A Bayesian analysis makes explicit this weighting of current findings and existing knowledge. Small studies, including RCTs, could benefit from Bayesian analysis. However, Bayesian analysis should be used both when findings are “negative” and when they are “positive”. Thus, “positive” RCTs may become “negative” and vice versa. One advantage of Bayesian analysis is that predictions based on this approach are more accurate than those based on frequentist analysis. In other words, predictions about the true effect of the treatment will be more accurate if they are based on a Bayesian than on a frequentist analysis. Thus, a Bayesian analysis could give us a better idea of whether further studies are needed to reach a conclusion.

2 Likes

Blockquote
The ANDROMEDA-SHOCK trial discussion on this platform discussed how a Bayesian approach might allow a more nuanced interpretation of smaller RCTs whose results don’t reach “statistical significance,” but where early data seems to be “leaning” in one direction.

While I agree with our host that Bayesian methods are generally preferable, there are simple transforms of p values that can minimize this error. I took a recommendation of Sander Greenland seriously, and started studying p value transformations, also known as “omnibus significance tests” or "p value combination methods."

By following the logic of testing methods to this point, I ultimately have to agree with Professor Harrell’s recommendation that the precision of the estimate is more informative than emphasis on power of a test.

Blockquote
Basing sample size calculations on the margin of error can lead to a study that gives scientifically relevant results, even if the results are not statistically significant. [emphasis in original – BBR Section 6.7]

The more I compare and contrast the theory of p values, with their actual use in practice, the more difficult I find it to accept research summaries of “positive” and “negative” trials.

I had a colleague ask me “Why is it that I can find hundreds of studies demonstrating positive effects (often with sample sizes in the hundreds), while a meta-analysis will reject the effect?” This was in the exercise science and physical rehabilitation literature.

This is a complicated issue. It appears 2 errors are being made:

  1. the informal use of improper vote count procedures based on significance tests, and
  2. improper understanding of hypothesis tests and confidence intervals (at the meta-analysis level).

I wrote the following 4 R functions. To keep things simple, the simulations assume a fixed effect for all studies. It would be interesting to simulate a random effects distribution of P values.

The first 2 to simulate the behavior of p values under the null and under an alternative of “moderate” effect: ie a standardized effect size of 0.53 in the code below. The median p value will be between 0.25 and 0.75 when the null is true.

1. P values under a true (point) null hypothesis.
In this scenario, p values take on a uniform distribution. Calling this
function with any integer N is equivalent to simulating N studies with a true difference of 0.

%%% Code for null P value distribution  %%% 

null_true_p <- function(n) {
  null <- c(runif(n))
  result <- quantile(null, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1) )
  hist(null)
  result 
} 

%%% End for Null P value distribution %%% 

2. P values under a true difference (standardized effect size of 0.53)
The median P value varies based on how many studies are done, but is frequently 0.75 \gt p \le 0.95, with small numbers of studies (a common scenario in meta-analysis). Note the skew the distribution. But skeptics will almost always be able to find a large number of “contradictory” studies going by “significance” alone.

%%%  Code for moderate true effect   %%%

null_false_p <- function(sample, sim_size) {
  reject <- numeric(sim_size)
    for (i in 1:sim_size) {
      x <- rnorm(sample, 100, 15)
      y <- rnorm(sample, 108, 15)
      reject[i] <- t.test(x,y,alternative="greater")$p.value
      result <- quantile(reject, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
     }
    hist(reject)
    result

} 

%%%  End Code for Moderate Effect %%% 

3. Meta-analysis under null
The following code simulates a meta-analysis based on p value combination using Stouffer’s method (ie. inverse normal transform). 95% of the (one-tailed) T values are within the range of \pm 1.645; 95% are in the range of \pm 2.326, and 99.9% are in the range of \pm 3.09)

%%% Code for Meta under null %%% 

true_null_meta <- function(sample, sim_size ) {
  stouffer <- numeric(sim_size)
  for (i in 1:sim_size) {
  null <- runif(sample)
  z_score <- c(qnorm(null))
  stouffer[i] <- sum( z_score/sqrt(sample) )
  result <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
  }
hist(stouffer)
result
}
%%% End of code for Null Meta %%% 

4. Meta-analysis with true effect
Finally, here is Stouffer’s method when the alternative is true, and there is a difference. The magnitude of Stouffer’s statistic will depend on the number of studies (as well as the magnitude of effect).

%%% Code for Meta of True Moderate Effect %%%

false_null_meta <- function(sample, sim_size) {
  z_scores <- numeric(sample)
  p_vals   <- numeric(sample)
  stouffer <- numeric(sim_size)
  for (i in 1:sim_size) {
    x <- rnorm(sample, 100, 15)
    y <- rnorm(sample, 108, 15)
    p_vals[i]   <- t.test(x,y, alternative = "greater")$p.value
    z_scores[i] <- c(qnorm(p_vals[i]))
    stouffer[i] <- sum( z_scores / sqrt(sample) ) 
    result_z    <- quantile(z_scores, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
    result_s    <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
  } 
   hist(stouffer)
   print(result_z)
   result_s   
}  

%%% End of Code for Meta of True Effect %%% 

Those who want to run the sims can do so at https://rdrr.io/snippets/. The plain text code can be found on pastebin here

You can call them, supplying the appropriate integer. IE:

null_true_p(30)
null_false_p(15,30)

This will simulate 30 random p values, and 30 skewed p values, with each
individual p in the latter being based on a sample size of 15. A histogram will be
drawn for both results.

For a meta-analysis simulation, each function requires 2 values – a sample
size for the individual (meta) study, and sim_size that determines how many
(meta) studies to simulate.

The code is very rough at this point. I didn’t account for power in these simulations; that is done informally via increasing the sample size and/or number of studies. That would be a very useful addition to improve intuition in examining a collection of studies. But I think it helps illustrate the common problems with p value and “significance” test (mis)-interpretations.

Reference
Median of the p Value Under the Alternative Hypothesis

2 Likes

Thanks for your response. I should have been more specific when I said “unreliable” (especially in a stats forum)- I meant “unreliable” in a strictly clinical context i.e., “should I use the results of this small RCT to change my practice now (because the study has an impressive point estimate with a confidence interval that doesn’t cross 1), or is there a pretty good chance that this finding will be overturned by a larger, subsequent RCT at some point down the road and I’ll have to reverse my practice?”

I realize that the results of some small RCTs could hold up over time and that small trial size, at least in theory, maybe should not be automatically disqualifying. But in practice, the question is how to identify the small trials to trust?

Your statement “if the findings suggest a null effect of the treatment is unlikely, then the fact that the trial is small should not be of great concern” leaves me a bit confused. There are many examples in medicine (particularly in certain specialties) where small trials with “statistically significant” results have proven misleading. This paper looked at FDA’s practice of requiring two “positive” trials before approval of a new drug:https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173184 The findings are somewhat disconcerting and suggest that we need to carefully consider a number of contextual factors (e.g., sample size, number of “positive” versus “negative” trials, effect size) when deciding whether evidence of treatment efficacy is sufficiently strong.

Re “exchangeability”- I’m afraid this isn’t a term I’m familiar with- I’ll do some reading.

Exchangeability corresponds to comparability, though it is more formally defined. If you assign a group of patients at random to treatment A or treatment B, your expectation is that were neither group be treated or were both group to received the same treatment (say A), the risk (level) of the outcome of interest (i.e. the potential outcome) would be the same in both groups. Exchangeability is an assumption needed for causal inferences and pertains to both randomized experiments and observational studies. When you compare the distribution of prognostic factors in Table 1 of a RCT, you are evaluating the likelihood that the assumption of exchangeability is “true”.

Regarding the interpretation and use of findings from RCTs, large or small, IMO, you won’t find a simple rule to make a call. You’d be better off if you evaluate each study the same way you evaluate each individual patient. Patients may share general characteristics, say malaise and fever, but you still need to do a clinical evaluation of each one to make an accurate diagnosis. You should do the same when judging the reliability of any study. There is no single formula to identify whether a trial, small or large, is trustworthy or not. You need to evaluate the likelihood of different types of bias in each study, as well as the precision of the results (assuming the magnitude of the biases is small).

I understand your concern about my statement: “if the findings suggest a null effect of the treatment is unlikely, then the fact that the trial is small should not be of great concern”. I did not mean findings from a small trial can not be later disproven. But so can the findings from large trials. When I look a the results of any RCT, large or small, I don’t pay much attention to the sample size or p-values. Instead, I check the 95% confidence interval (CI) for the effect of the treatment. Suppose the average effect is a risk ratio (RR) of 0.75, with a 95% CI: 0.70, 0.80. This 95% CI tells me that values between 0.70 and 0.80 are compatible with the data generated in the study, and that values outside that interval are less and less compatible with the data the father they are from 0.70 and 0.80. I would not believe this treatment has a null effect, since a RR=1 seems very incompatible with the data from the study. If I have no reason to attribute the effect of the treatment to some type of bias, I would not be very concerned if the trial is small (say 20 patients). Instead, I’d think the treatment has a very consistent effect in the subset of patients that were eligible for the RCT. Of course, I would also consider external evidence in the interpretation of the findings of any study (by external I mean, outside the study itself). For instance, if the treatment effect from above corresponds to the effect of aspirin on mortality in patients with Guillain-Barré syndrome, I’d be hesitant to incorporate the findings of the study in my practice, even if I can’t explain why aspirin seems protective. Finally, I don’t think we should be too concerned about changing our practice in the face of new evidence. After all, our commitment with patients is to provide the best advice according to current knowledge, even if that knowledge is imperfect.

1 Like

That’s what I thought you might mean by “exchangeability.” If exchangeability is synonymous with “comparability” of covariate distribution between treatment arms, then I’m confused because I can’t reconcile advice to consider exchangeability with the advice provided in another thread in this forum (which discusses problems with searching for covariate imbalance in Table 1 of RCTs…).

With regard to the need to examine each “positive” small trial individually and make a decision about whether to trust it or not, I can’t say I entirely agree. Historical precedent seems to suggest that the results of small RCTs are subsequently overturned with sufficient frequency that I will be unlikely (in most circumstances) to allow small positive trials to change my practice- I will wait for evidence from larger trials. As another poster noted above, it’s common for drugs in development to generate non-positive Phase III results after promising Phase II results. I expect that many drug researchers were certain that their drug would pass Phase III with flying colours, only to face acute disappointment. If researchers with a lot more experience than me find it hard to predict “winners” based on small RCTs, I’m not sure I would be able to do any better. Since many Phase II trials are small RCTs, changing my practice based on a couple of published small RCTs seems, to me, to be analogous to acting on the results of Phase II trial(s).

I’m not sure that we’ll agree on how to interpret the hypothetical 20-person trial you describe above. I find it hard to imagine any scenario where I’d implement the results of a trial this size in practice, even if the confidence interval excluded 1 by a large margin. While I might briefly consider that this result could reflect a large true effect, I’d be more likely to conclude (knowing that most true effects are small to moderate in size) that this result might be attributable to “chance” (perhaps related to a non-representative patient sample) and to wait for results of a larger trial. The paper referenced in my last post (the one looking at FDA studies) suggests that the evidentiary value of a “positive” RCT is at least partly a function of how many other trials have been run examining the same question. The authors seem to be saying that we should probably be a lot more skeptical about the evidentiary value of two trials with p<0.05 if those trials are occurring in the context of multiple “non-positive” trials (e.g, 2 “positive” trials out of 10 trials run) than if the two positive trials are occurring in the context of only two total trials that have been run. By extension, if we imagine that the only small positive RCTs that get published are those that happen, by “chance” to have achieved statistical significance, and that there may be other unpublished “non-positive” small trials on the same subject, it seems clear (to me at least) that we should be cautious about changing our practice on the basis of small “positive” RCTs.

Thanks for an interesting dialogue.

I can’t reconcile advice to consider exchangeability with the advice provided in another thread in this forum (which discusses problems with searching for covariate imbalance in Table 1 of RCTs…).

I am not sure what your concern on this regard is. When prognostic factors appear balanced in treated and untreated participants (Table 1 of a RCT), the assumption of exchangeability becomes more likely, though it could never be proven.

Historical precedent seems to suggest that the results of small RCTs are subsequently overturned with sufficient frequency that I will be unlikely (in most circumstances) to allow small positive trials to change my practice- I will wait for evidence from larger trials.

I understand your position. However, I assume if you are somehow interested in the results of a RCT, small or large, it is because knowledge on new or existing treatments may result in better care for your patients. IMO, discarding all small trials a priori , “positive” or not, assuming their findings are erroneous, may not be the best for achieving this goal. Of course, you can wait for larger RCTs to change your practice. However, this surely depends on the context. Consider, for example, the extreme case of ketamine in the treatment of clinical rabies. Should we wait for large trials?

I’m not sure that we’ll agree on how to interpret the hypothetical 20-person trial you describe… While I might briefly consider that this result could reflect a large true effect, I’d be more likely to conclude … this result might be attributable to “chance” (perhaps related to a non-representative patient sample) and to wait for results of a larger trial.

Statistically speaking, any finding could be attributed to chance. However, in this example, the probability that the true effect is null (RR=1) is extremely low. Concluding the observed effect is a chance finding will be erroneous in almost all similar cases. Conducting the study in a non-representative sample of patients does not increases the likelihood of a non-null finding if the true treatment effect is in fact null. Actually, most if not all RCTs are conducted in non-representative samples of patients. Randomizing non-representative samples of patients does not result in bias, though it may compromise the generalizability of the findings to other patients.

The authors seem to be saying that we should probably be a lot more skeptical about the evidentiary value of two trials with p<0.05 if those trials are occurring in the context of multiple “non-positive” trials (e.g, 2 “positive” trials out of 10 trials run) than if the two positive trials are occurring in the context of only two total trials that have been run.

I feel counting “positive” and “negative” trials may not do the trick. The assessment of the validity of a study should be based on how the study was designed and conducted, not on its findings or on the findings of other studies. If we have 10 trials and only two of them show a beneficial effect of a treatment, should we conclude the treatment is not beneficial? Maybe. Maybe not. It all depends on the strength (quality) of the evidence from each trial. If all trials in the set of null trials were poorly conducted, while the two discordant trials were well conducted, then we should accept the findings of the latter. If the two discordant ones were poorly designed, we should disregard their findings (independently of the quality or findings of previous trials). If all trials were well designed and conducted, then we should look for reasons to explain the heterogeneity of the treatment effects (random variability, differences in eligibility criteria, compliance with treatment, administration of the intervention, ascertainment of the outcome, masking of treatments, co-interventions, etc.?). Briefly, any simple rule, like counting trials, would be insufficient to assess the findings of each individual trial and/or the cumulative evidence from the whole set of all trials.
I also appreciate the exchange.

I agree with some, but not all the points above. I still don’t see how exhangeability/comparability of treatment groups is relevant to a discussion of the stability of the results of small RCTs, unless a covariate imbalance stems from some type of mechanical failure of the randomization process itself (which I think might be fairly rare, but perhaps more common in smaller trials with less quality control?). Assuming that the randomization process is sound, then my impression (from the Table 1 thread), is that internal validity is maintained, even if we identify between-arm imbalances in the distribution of any particular set of covariates we might choose to measure. In other words, I don’t see that an imbalance in distribution of any particular set of covariates would bias the trial’s result (i.e., move the point estimate)…

I’m afraid that after my reading, I still don’t agree that the chance is “extremely low” that the “true” effect is null if a tiny trial shows a marked, statistically significant effect.

Re non-representative patients “biasing” a trial’s results, I don’t think I said this anywhere. I was referring to potential suboptimal generalizability of the results of a small trial to a larger, more heterogenous patient population (where the same effect might not be demonstrable). I’ll continue to wait for larger trials in most cases (particularly since most small trials aren’t conducted in near universally fatal conditions like rabies…:slight_smile:

I do agree, though, that there are potential problems with the idea of counting “positive” and “negative” trials to decide whether there is or isn’t likely to be a true effect. I think FDA struggles with this when deciding whether to approve new drugs. A good example from a few years ago was the antidepressant gepirone- you can look for the FDA documents if interested. A big problem seems to be distinguishing “negative” trials (where the drug lacked intrinsic efficacy) from “failed” trials (where the trials lacked assay sensitivity, resulting in failure to reveal the drug’s intrinsic efficacy). From what I’ve read, poor assay sensitivity seems to be fairly common in trials in certain disease areas (e.g,. depression, chronic pain). Some trials in these areas seem to colloquially get labelled “negative” in the lay media but were actually “failed” for one or more reasons- e.g., they enrolled subjects who didn’t actually have the disease in question (or who had such mild symptoms that they got better anyway, even without the drug) or because we don’t have optimally reliable ways to measure treatment effect for certain conditions. Sometimes these problems with assay sensitivity might get ironed out over the course of drug development, such that later, bigger trials are better able to demonstrate a drug’s intrinsic efficacy. So I agree that caution is needed when evaluating the “ratio” of ostensibly positive to ostensibly “negative” trials.

Randomization does not guarantee exchangeability. The very definition of random implies imbalances in prognostic factors could happen, even if they are more and more rare as the sample size increases. If there are imbalances in Table 1, the treatment groups are not exchangeable and internal validity is compromised. In this case, adjusting for prognostic factors may improve validity.

Blockquote
Randomization does not guarantee exchangeability. The very definition of random implies imbalances in prognostic factors could happen, even if they are more and more rare as the sample size increases.

This very important point is carefully elaborated in the following:

So that leads me to the following question:

Are small randomized trials worth conducting?

In a previous post @f2harrell wrote:

Blockquote
Most statisticians I’ve spoken with who frequently collaborate with investigators in doing frequentist power calculations believe that the process is nothing more than a game or is voodoo. Minimal clinically important effect sizes are typically re-thought until budget constraints are met,…

For the purpose of this discussion, I’ll define a “small” experiment is where N < 200 per arm. I’ve seen this magic limit in a few simulation studies comparing randomization to minimization, so I will use it as a starting point to elicit how much value (if any) randomization is bringing to the possible experiment.

One of the big problems of “small” RCTs is the issue of random confounding. It appears (from a practical POV) that 200 per treatment arm is the minimum needed to silence everyone but the most vocal of skeptics who point out:

Blockquote
Critics of RCTs will argue that because there’s also always the possibility of there being an imbalance of known or unknown covariates between groups, RCTs cannot make proper causal inferences, especially small RCTs that are “unable to distribute confounders effectively.”

Simulations suggest that small RCTs are much more prone to this problem.

There is a credibility issue with small RCTs where the AHRQ suggests that there exist times when: “…it may be appropriate to focus on the one or the few “best” trials rather than combining them with the rest of the evidence.”

AHRQ (2018) Quantitative Synthesis: An Update – Sections 1.3 - 1.5.

From a pure decision theory point of view, randomization is not needed for correct inference. To be more precise: For any randomized experimental design, there exists a nonrandom design that will also provide the correct answer (more efficiently).

The proof sketch involves maximizing a particular criterion. In this case, the “optimal” experiment is one that gives the most information, with the smallest sample size.

If you think about how statistics are applied in industry (and health care quality), the key point of any “quality” initiative (ie. Six Sigma, for example) is to minimize variance. In finance, the key to maximizing the compounded growth rate is also to minimize variance. Indeed, in finance risk is defined as the variance in returns. Maximizing information is synonymous for minimizing risk (variance) in this context.

So why do we maximize variance in the case of experiments by insisting on randomization?

How relevant is this to practice? I’m not sure. The claim is one of existence and does not provide any guidance on how to derive the design of such an experiment in a particular case.

I accept that from a normative POV – causal inference can be made without randomization. I also accept that certain contexts (ie research with human subjects) randomized designs are often easier to derive and conduct. In the case of N-of-1 designs, randomization seems to be the only way to proceed. But I don’t think non-randomized designs are impossible in human subjects research; for given certain budgetary limitations, they might be the best way to proceed. Even the CONSORT guidelines recognize minimization as a valid a alternative to randomization.

CONSORT (2010) Explanation and Elaboration

Blockquote
Nevertheless, in general, trials that use minimization are considered methodologically equivalent to randomized trials, even when a random element is not incorporated. [See Box 2].

The late Douglas Altman has also written on minimized designs:

Blockquote
But stratified randomization using several variables is not effective in small trials. The only widely acceptable alternative approach is minimisation,2,3 a method of ensuring excellent balance between groups for several prognostic factors, even in small samples.

Altman, D. (2005) Treatment Allocation by Minimization

There is a trade-off that needs to be made between the value of the information, and the cost of the experiment. This appears to be difficult to do in the Frequentist philosophy; its influence has all but silenced discussion of nonrandom, but valid, research designs.

For research areas where small samples are the only ones economically obtainable, a Bayesian approach to experimentation is critical to maximizing the information obtained, for a defensible decision.

There exist algorithms that explicitly minimize the difference between treatment and control groups. I raise these points, because my field (allied health) is not ever going to have sample sizes of even a small drug trial; and if any research we produce is going to be dismissed simply because it is “small sample” (as the AHRQ link suggests) that presents a big problem for a number of professions.

Too much informal, frequentist reasoning has influenced what is perceived to be “rigorous.” The insistence on randomization for rigor has gone from a (very) useful heuristic, to a mistaken normative criterion.

I suspect that last statement is controversial, but I’d very much appreciate scholarly discussion and debate about it.

References (far from exhaustive):
Taves, D. (2010) The use of minimization in clinical trials
Treasure, T; MacRae, K (1998) Minimization: The Platinum Standard for Trials?
Jachin, J; Matts, J, Wei, LJ (1988) Randomization in clinical trials: Conclusions and recommendations

For a contrary view:
Senn (2008) Why I Hate Minimization

2 Likes

A couple of good papers on this topic:

http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

nature.com/articles/nrn3475

1 Like

I’ve enjoyed this discussion so far. I reworked R_cubed’s code a bit to be a single interactive executable. https://pastebin.com/4CtSHixg

1 Like