Is it legitimate to perform statistical test on test statistics such as t-values or z-scores?


I wonder if it is valid to perform a statistical test on test statistics such t-values or z-scores. I don’t have a specific example but let’s imagine there are two groups, A and B, each containing 50 individuals, and every individual contains 100 measurements. For every individual, a t-test is conducted and the value itself is used to represent “effect size or standardized effect”, then one wants to know whether the effect size in group A is significantly larger than group B, so Mann-Whitney test on these two groups of t-values is conducted. Or in another case, one just wants to test if the mean of all t-values in group A is significantly larger than 0, so a t-test is conducted on these t-values. Are these processes valid? It sounds weird for me to do tests on test statistics, but I don’t have a theoretical backup. It seems t-value is used as the standardized effect for each individual in each group because there are multiple measurements for each individual. Then t-value is a “better choice” than the mean for the second step test between groups because it also considers variance. In other words, t-value is viewed as a variance penalized effect. However, if the question is about whether the measurement is different between two groups, the input should still be the values of the variable of interest, then a mixed-effect or hierarchical model is more appropriate than a two-step test. Am I right? Thanks a lot.


I don’t think it is valid to do so; based on your hypothetical example I think some kind of multilevel or repeated measures model would be more appropriate. Such models will also “consider variance”.

While not quite the same, it seems to me that it is (maybe only slightly) related to what this paper[1] discusses. Here, Gelman & Stern explicitly mention results in which a binary “statistically significant” vs. “not significant” decision is made, and the problem in comparing studies based on that determination.

[1]: Gelman, Andrew, and Hal Stern. “The difference between “significant” and “not significant” is not itself statistically significant.” The American Statistician 60, no. 4 (2006): 328-331.

1 Like

I don’t think the procedure will make some statistical assumptions, e.g., the t summaries may not have the same variance. But more importantly I don’t think the result is interpretable. You could find that there is a difference that is not due to one mean being larger but due to one standard deviation being smaller.

It is almost always best to analyze the raw data and taking care to account for the within-individual correlation structure. If you don’t want to do that then a summary statistic approach can be used as discussed in BBR’s longitudinal analysis chapter. The summary statistics usually need to be on the original scale, e.g., means, medians, inter-quartile ranges, SDs, or slopes. The summary statistic reduces each individual to one number (occasionally you reduce to > 1 number and do a multivariate test instead).

1 Like

Thank you @cwatson and @f2harrell for the confirmation.

@f2harrell could you please elaborate a bit on this point, I wonder why summary statistics usually need to be on the original scale rather than a variance normalized “effect” such as z-score? I am on the side that we should use the original scale since it preserves the unit or physical meaning of the measurement and it is easier to interpret and communicate. However, often I see in application works people use z-score or t-value or Cohen’s D to represent the effect even when the measurements have the same unit which has a clear physical meaning. It is hard to say they are all wrong. My attempt to rationalize this approach is that sometimes the variance of every individual’s measurement is not always the same, then from an engineering perspective, variance somehow reflects SNR of the measurement, so normalizing the effect (e.g., mean) by variance is sort of a way to “calibrate” the SNR difference among individuals, especially when there is no other way to avoid this by experimental design or hardware calibration. For instance, when comparing 10 \pm3 (unit) and 2 \pm0.5 (unit), I understand statistically these two may not be significantly different, practically one may prefer the latter one since it has a larger z-score or larger effect after penalizing by variance. Should we conclude that there is no rule for choosing between mean and z-score, and it should be decided by utility? Thank you very much.

I just can’t interpret z-scores or D. And I don’t think many people can. z scores are based on the premise that the SD is the dispersion measure to use, and it seldom is due to asymmetry in the raw data.

1 Like

You might want to follow up on the references in this thread, especially by Senn and @Sander_Greenland

The essential critique is that the variance is largely affected by study design. In essence, you are “standardizing” by something that is variable, not a constant. That is how I understood Greenland’s numerous articles on standardized effect size.