Inequality of Variances in an RCT



Prof. Harrell has wisely suggested moving another twitter discussion on statistics here. The original discussion is found here (

There are several issues:

(1) The initial tweet seems to suggest that it is reasonable to analyze change in a continuous outcome (weight) in each arm of an RCT separately (i.e. final-initial, presumably with a one sample t-test). To me this seems very wrong. I would think that an ANCOVA to estimate the between group difference in final weight, adjusting for baseline weight and possibly other things, is what we really want, but I would appreciate input from others.

(2) A deeper debate arose from the construction of this example with markedly different variances in each arm. I would think this would be very unlikely in an RCT unless there was significant heterogeneity of treatment effect or ceiling/floor effects or a smallish sample size with random variation. Am I missing something?


Thanks for posting here Venk. This example raises many issues, including:

  • You are right that analyzing change from baseline by treatment group is a terrible idea, for reasons that are detailed in Chapter 14 of BBR.
  • ANCOVA on the final weight is the preferred method, and answers the crucial question of whether two persons who began at the same weight end with the same weight.
  • The issues are deeper than ANCOVA because the choice of metric for weight should not be taken for granted. The impression one gets from looking at two variances depends strongly on the transformation of weight that is used before computing the variance. The discussion seems to implicitly assume the identity transformation.
  • A more general way to ask the question is to estimate the odds or probability that a person will have a weight exceeding y given starting weight x, for one treatment compared with the other treatment. For example, the proportional odds model estimates the odds ratio for final weight > y for treatment B vs treatment A if the person started with weight x under either treatment. This odds ratio will be the same no matter which transformation of final weight is used.
  • One may find that the initial weight needs to be nonlinearly transformed.
  • Heterogeneity of treatment effect may come into play, or it may be just that one treatment has a huge effect.

So don’t automatically think means and variances on the raw scale. Think more generally.


Great point! I think I was thinking about non-linearity when I mentioned floor/ceiling effects but you are right that non-linearity can take many forms other than those two.

I say this slightly tongue in cheek, with regards to switching to odds of weight greater than y, aren’t you then in the #dichotomania realm?


I think the idea here is to make “y” the MCID.


No it’s to do this for all y, so no dichotomania. If the proportional odds assumption is satisfied, y is not relevant, you just need to temporarily fix it.


I’m a bit confused. Won’t this generate a plethora of models?


No. The proportional odds model, a generalization of the Wilcoxon test, uses a single model for P(Y \geq y | X). To get that probability you use the y^{\text{th}} intercept and add it to X\beta. But if you compute the odds that goes with the y-specific probability for a covariate setting X, then compute the same for a different covariate setting and take the ratio of the two odds that Y \geq y, the intercept cancels out. So the odds ratio is the same for every y.


Is there a reference to show/prove that the prop odds model it is an extension of the wilcoxon test?


I can’t find the reference at the moment. I think it was a paper by Whitehead. Briefly, if you have a single covariate and it is binary, the numerator of the Rao efficient score test for testing for a difference between the x=0 and x=1 groups is exactly the Wilcoxon-Mann-Whitney two-sample rank sum test statistic, using midranks for ties. The likelihood ratio \chi^2 test probably works even better, and handles extreme ties better than the Wilcoxon test does. The limiting case of extreme ties is the binary logistic model.


Interesting exchange… More generally what does heteroscedasticity mean in the setting of an RCT? Perhaps that the treatment effect is heterogeneous (the weight example is a great one) ?
I do think that if one were to seriously entertain heterogeneous treatment effects , then one is deep in the realm of subgroup analyses.


Indirect heterogeneity of treatment effect can be analyzed without envisioning subgroups. See here near the bottom where I discuss meta-analyses of within-treatment variance (under Evidence for Homogeneity of Treatment Effects).

But don’t talk about variation in variance (heteroscedasticity) until you have perfected the transformation of Y. Barring that, consider the more generalized approach I suggested that is transformation invariant.


Thank you, Frank!
I just found the paper (I guess):
John Whitehead, “sample size calculations for ordered categorical data”, stat. in medicine (1993): 2257-2271


Right - here it is: