Are we using the Mann-Whitney test wrong?

Hello everyone,
while preparing some slides for an undergrad course, I stumbled upon a question that is tormenting my mind about the MU test.

The Mann-Whitney U test compares the distributions of two groups. The null hypothesis is that the two distributions are equal, and rejecting it, therefore, implies that the two distributions are not.

However, rejecting the null hypothesis does not necessarily imply that the two distributions have two different central tendencies (mean or median).

To conclude that the central tendencies are different, the additional assumption that the distributions of the two groups have the same shape must be satisfied since the test is very sensitive to differences between the shapes of the distributions due to unequal variances or different skewness (Zimmerman, 2003)

So, should the MU test be used to test the null hypothesis that the distributions are the same and not that the central tendencies of the distributions (mean or median) are the same?
I am using it wrong just like a non-parametric alternative to the t-test?

Thak you for your replies


If we take the construction of the real number system (an Archimedean ring if we want to be extremely precise) seriously, we know a priori that the probability of choosing 2 random real numbers that are exactly equal, is zero.

The real question is does our sample provide enough information to contradict the assumption of no detectable difference? If the distributions are different, QED.

The true estimator for the MWM is the Hodges-Lehman estimator. You can think of it as a compromise between the mean and median in that it improves on the efficiency of the median, and the robustness of the mean for a wide variety of distributions. The HL Estimator exists when the mean does not (ie. Cauchy Distribution).

Those who complain about variance heterogeneity with the MWM should think about why they would prefer the T-test result. The real question should be to figure out the cause of the heterogeneity, and control for it, which is what proportional odds and other regression models do.


The Mann-Whitney U test tests the null hypothesis of stochastic equality, namely \widehat{\operatorname{Pr}}(X_1<X_2) + \widehat{\operatorname{Pr}}(X_1=X_2)/2=\frac{1}{2} (for details, see the papers linked below). Unfortunately, this parameter is known by many different names. Fay and Malinovsky (2018) call it the Mann-Whitney parameter while others call it the probabilitistic index (e.g. Perme and Manevski (2019).

It fails as a test of medians, as Divine et al. (2018) explain in their paper:

The most comprehensive overview of the MWU test I’m aware of is the paper by Fay and Proschan (2010):

The MWU test can be viewed from many different sets of assumptions, which Fay and Proschan call perspectives. Every perspective is associated with a different set of hypotheses and assumptions.

As such, the MWU is never an alternative to the t test in the sense of a direct nonparametric replacement, because it tests different hypotheses. The fact that the MWU test is commonly taught as a nonparametric alternative to the t test is a shame because it (mis)leads students to think that it’s basically the same test (i.e. for means or medians), just nonparametric. If you want to test differences in means nonparametrically, there are better alternatives, such as permutation tests or bootstrapping. On the other hand, the MWU test can be an excellent “replacement” of the t test, because the hypotheses it tests are actually interesting, relevant and easily interpreted.

Finally, I found the recent paper by Karch (2021) interesting.


Fantastic information! The Wilcoxon-Mann-Whitney test is a nonparametric t-test in one narrow sense: it is the ordinary t-test computed on the ranks instead of the raw values.

The concordance probability (probability of stochastic ordering) is in my mind the best way to think about what the test is testing.


I have not used the WMW test in at least a decade. However I regularly use the MW statistic for estimation of a quantity closely related to the probabilistic index.

It can be generalized for right-censored data:


Good stuff. But the last paper omitted the key reference (E Gehan who developed the Gehan-Wilcoxon test for right-censored data) which makes it suspect.

We are well aware of Gehan’s paper but consider his method unusable. A colleague has been actively investigating the properties of the Zhang et al method. Long story short, it should be taken seriously. Expect results to be presented this fall if not sooner.

So that translates into not referencing Gehan? How does the new test differ from Gehan’s?

No, I didn’t say that…let’s pump the brakes here. The estimand is the same as Gehan’s, but the estimator is very different. The new estimator is based on the Kaplan-Meier estimator, while Gehan’s is not. I agree that the authors should have cited Gehan, which they failed to do. However, it’s a nonsequitur to use that mistake to dismiss the entire work as “suspect”.

On another recent thread I gently noted that an excellent paper had misattributed the random sampling framework of Fisher 1922 to Neyman & Pearson 1928- I called it a “nitpick” because that is exactly what this type of blunder is. (And a much bigger blunder than the one just discussed.)

I undestand the temptation to dismiss others’ work. Very few papers are worth our time, and we use heuristics to filter the literature. However, public criticisms of a paper should be justified, and on this forum I thought they usually were.

1 Like