Should people stop using the WMW rank-sum test?

The author of this paper “t-tests, non-parametric tests, and large studies—a paradox of statistical practice?” puts it better than I can paraphrase:

“Non-parametric tests are most useful for small studies. Using non-parametric tests in large studies may provide answers to the wrong question, thus confusing readers. For studies with a large sample size, t-tests and their corresponding confidence intervals can and should be used even for heavily skewed data.”

I’d say most clinicians 1) use WMW for skewed, not just ordinal, data and 2) read table 1 as a comparison of means and medians. Should people ditch non-parametric tests for table 1?


i attended a presentation by @f2harrell, if i remember correctly he is averse to eg the wilcoxon rank sum test because you don’t adjust for covariates (at least this is what he said about the log rank test) and you can get more from a proportional odds model

edit: re ttest, he said: use the bayesian equivalent

I disagree with the quoted advice on many levels.

  • It is simply a myth that t-tests work well with large samples for heavily skewed data. Try it with a sample of size n=50,000 from a log-normal distribution. You’ll see that the non-coverage probabilities in both the left and right tails of the confidence interval are very bad. And the power of the t-test will suffer. People seem to get excited that the type I error is preserved when the distribution is not normal. That’s only a small part of the story. [Part of the reason for this problem is that the t-test assumes that its numerator (difference in means) is independent of its denominator (a function of SD). This independence falls apart under skewed distributions.]
  • The Wilcoxon test is virtually as powerful as the t-test at all sample sizes even if normality and equal variances hold.
  • The WIlcoxon test tests something more general than the mean: do values in group B tend to be larger than values in group A?
  • The article, in looking at discrepencies in p-values between the t-test and the Wilcoxon test, failed to note that it is the t-test that is problematic in the situations studied. The author misinterpreted his own simulations.

But don’t forget that it is inappropriate to use any statistical tests in Table 1. More about that coming today in another topic.


Ah, yes, I look forward to this. It’s become one of my favorite crusades since I had the ah-ha moment where it fell into place how silly hypothesis tests in Table 1 are (and I’ll say this as an author of dozens upon dozens of papers where I did this…)

1 Like

@PaulBrownPhD: edit: re ttest, he said: use the bayesian equivalent

Seems like a reasonable alternative especially now that JASP offers a free point-and-click method for doing so. R also provides a nice package for doing a fairly straightforward Bayesian ttest equivalent.

@f2harrell: But don’t forget that it is inappropriate to use any statistical tests in Table 1. More about that coming today in another topic.

Interested to hear more about this. I never lend that much credence to statistical tests in table 1, but admittedly provide the p-values because everyone else seems to want them. There are swords I am willing to die on–no p-valuse in table 1 is not that sword.

Of course, I can be convinced :wink:

That is essentially how I feel as well.

I don’t want to spoil @f2harrell’s subsequent post, but here’s my brief take since I started thinking about this: an independent-samples t-test is meant to test whether a given set of data was likely to be observed under a null hypothesis that the population means of the two groups are the same. In an RCT, this means we would be testing whether the population mean for patients assigned to Drug A is different from the population mean for patients assigned to Drug B. But…that’s ridiculous, because (in most cases) we should know that they came from the same population (assuming the randomization was carried out correctly) and that any differences are simply due to the randomization, not an underlying difference in the population means. Therefore, the Table 1 p-value is a “probability” of something that we already know the answer to. The descriptive statistics in Table 1 are valuable - they let us see what the patients look like in the randomized treatment arms - but the p-values are rather poppycock once you get down to it.


Thank you. Looking forward to this. In your future post, would you comment on if/why it should apply to table 1 of observational studies too?

I’m glad you mentioned observational studies, too. While reading Andrew’s post I realized he and I were talking about different types of studies. For RCT his explanation makes a great deal of sense while in obs studies people tend to want to see where differences may be. I will note that in large studies even minute differences pop up as “significantly” different.

Question. In RCT since with appropriate randomization we should see minimal differences between groups and confounding is mitigated, do we really need to show patient characteristics from each arm, or are overall characteristics enough. The overall should be able to describe the patient population of interest, no?

I confess no experience with RCTs. So forgive two schoolboy questions: 1) how would you know if the randomisation was inadequate, and 2) people adjust for baseline imbalance in RCT outcome analyses (presumably diagnosed from table-1-tests) - is this wrong?

As someone else has just started a new topic on the subject of Table 1 and p-values, I advise taking further discussion there so as to avoid further derailment of this thread.

EDIT: and Frank has just put up a superb and detailed post that covers all of these questions.

Please refer to the other post for this discussion.

Yes adjusting for observed imbalances is wrong. You are then adjusting for variables that are maximally co-linear with treatment, reducing the power of the treatment comparison. And how you know the randomization is adequate comes from concealment of treatment assignment until the last second and from the random process used to create the randomization table.

As with so many tests, the real issue is whether significance testing should be paramount. It should not. Mann-Whitney U easily turns into an index U/mn by dividing by the product of the two sample sizes. U/mn is a very useful effect size measure and is equivalent to the area under the ROC curve. It equals 0.5 on H0 and 0 or 1 if the two samples don’t overlap. Report it with a confidence interval, see
Newcombe RG. Confidence intervals for an effect size measure based on the Mann-Whitney statistic. Part 2: Asymptotic methods and evaluation. Statistics in Medicine 25, 4, 559-573. Feb 2006. DOI: 10.1002/sim.2324
Excel resources for calculating confidence intervals for this and other effect size measures related to proportions are available at


i wrote sas code to derive the confidence intervals you described in stat in med:

%macro bisect(ci_var);
data temp;
retain lastpos lastneg &ci_var count 0;
set score2 ;
z=1.959964; for 95pct CI;
&ci_var=0; &ci_var.1=1; root=p_index; lastpos=1; lastneg=0; initial values;
%do bisect=1 %to 40;
%if &ci_var=lower %then %do; root=&ci_var+zse-p_index; %end;
%else %if &ci_var=upper %then %do; root=&ci_var-z
se-p_index; %end;
if root lt 0 then lastneg=&ci_var;
else lastpos=&ci_var;
if root lt 0 then &ci_var.%eval(&bisect+1)=lastpos;
else &ci_var.%eval(&bisect+1)=lastneg;

data &ci_var;
set temp;
if count eq 40 then output;
keep &ci_var & se;
%mend bisect;


1 Like

The R Hmisc package rcorr.cens function uses the general U-statistic formula to get a standard error on Somers’ D_{xy}, which can be back-transformed to the c-index (concordance probability).

1 Like