Most reasonably hypothesised effects cannot be exactly zero?

A few authors over the years have suggested that, in most cases, effects (and all effects are causal by definition) cannot be exactly zero, at least in fields like health and the social sciences. For example:

“All we know about the world teaches us that the effects of A and B are always different — in some decimal place — for any A and B. Thus asking “Are the effects different?” is foolish.”
(Tukey (1991); doi:10.1214/ss/1177011945)

“In any hypothesis testing situation, a reason can be found why a small difference might exist or a small effect may have taken place.”
(Nester (1996); doi:10.2307/2986064)

“Who cares about a point null that is never true?”
(Little RJ. Comment. The American Statistician. 2016;70(suppl))

“Also, remember that most effects cannot be zero (at least in social science and public health)”
(Gelman, Carlin (2017); doi:10.1080/01621459.2017.1311263)

“As is often the case, the null hypothesis that these physical changes should make absolutely zero difference to any downstream clinical outcomes seems farfetched. Thus, the sensible question to ask is “How large are the clinical differences observed and are they worth it?” — not “How surprising is the observed mean difference under a [spurious] null hypothesis?””
(Gelman, Carlin, Nallamothu (2018);

While this has obvious relevance for the usual conception of the null hypothesis, and this was the context for each of the above quotes, it may have some additional influence on the inferences researchers make - perhaps framing the question in terms of whether the difference is clinically meaningful, rather than asking if an effect or difference exists, would change how results are sometimes viewed and interpreted?

Clearly, there would be hypothesised effects that really are exactly zero, meaning 0.00000…, or are close enough that they cannot be distinguished from zero. But these cases would not seem to be common outside of the physical sciences.

I’ve posed this topic to see if anyone:
a. disagrees
b. thinks that it doesn’t matter
c. has a different interpretation of the above statements, or
d. has a thought on how such a viewpoint might affect inferences
(or e. anything else)



Tim, under the ‘anything else’ category, I’d add only Paul Meehl’s famous quote [1]:

Since the null hypothesis is quasi-always false, tables summarizing research in terms of patterns of “significant differences” are little more than complex, causally uninterpretable outcomes of statistical power functions.

  1. Meehl, Paul E. “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology.” Journal of Consulting and Clinical Psychiatry 46 (1978): 806–34.

The whole testing approach is not about estimation. Testing is a fall-back if estimation is unfeasable, either because it’s not quite clear or not relevant what the estimates would mean (at leat practically; one might think of statistics like R², Chi², F etc. that may be hard or impossible to interpret in any practically sensible or understandable way), or because the information in the data is insufficient to allow an estimation with a useful accuracy and precision.

If we fall back on testing, the questiion can not be whether or not an effect is precisely zero. This would just mean that we treta a test like an approch to estimate the effect. We often do so anyway, and this is a major source of confusion, imho. The test -specifically the significance Test [calculating a p-value according to RA Fisher] - is, instead, a procedure to judge the amount of information in the data at hand relative to a particular restriction in a particular (statistical) model. It was never, and should never be, the aim of the test to demonstrate an effect. The test is not about the effect (which is, by the way, neccesarily unknown to the test!) but about the data (more precisely about the “relevant” information in the data, w.r.t. to a restriction in a model). Thus, a “significant” test is not a demonstration of a “non-zero” effect. It is rather a demonstration that the data (the information provided by the data w.r.t. the restriction in the given model) is sufficient so that the effect is “visible with good confidence”. If the tested hypothesis is a “zero-difference” hypothesis, then a significant result means that we may place confidence in the interpretation of the sign of the effect.

1 Like

thanks David - another rare example I can add to my collection.

1 Like

thanks Jochen. My perception is that testing and estimation have generally been seen as part of the same process, but then you say ‘we often do so anyway’. You have an interesting point of view and I probably agree with quite a bit of it, though it’s hard to tell for certain without seeing your philosophy/approach applied to a variety of real problems.

I suspect it may be too rigid and too abstract to be taken up by the majority of researchers, however. Though nothing is certain in the evolution of ideas!

I don’t really see that this is “my philosophy” or “my approach”.

Testing is to find or judge the statistical significance of data under a given model. At no point there is made an estimation. The significance is usually determined by comparing the maximum likelihood of the data under the full model to that under the restricted model. The coefficient values under which the likelihoods become maximal are called “maximum likelihood estimates”, MLEs (NB: the least.squares estimates, LSEs, are a special case of the MLEs under the Gaussian error model). These are called “estimates”, but they don’t have the meaning to be “good” or plausible values for the effects (coded by the coefficients). They are “just” those values for which the maximum of the likelihood of the observed data is obtained.

On the other hand, estimation is to find “good” or plausible values of the coefficients, given the data and whatever else we know about the system. Here we look for coefficient values with high probability, given the available information, including the observed data. When the amount of information contributed by the observed data is overwhelming other aspects, the most probable values will be close to MLEs. So as long we can rely on large-sample approximations, MLEs are estimates in that sense. However, in my field of research I usually see samples sizes of 3-4, where we can be really wrong when we assume that an MLE would indicate a “good” or plausible estimate (as a value that would be considered probable, given the information we have).

Gauss knew that, and he interpreted his LSEs as estimates because they were seen as large-sample approximations. Bayes and Laplace noted that probability calculations up to their time were only about data, not about coefficients, although the scientific interest was usually in the coefficients (=estimation) and not in predicting data based on knwon coefficients. However, talking about estimates requires a “frame” specifying our knowledge about the system (today called “priors”), and there is no objective way to specify such a frame.

So that’s all old stuff and well-known, but heavily ignored after over a century of focussing on testing.

1 Like

Our experience certainly influences our point of view. Mine is in epidemiology and, as such, I have never seen testing that did not also focus on the estimates produced. But I can imagine areas of research where would be appropriate, such as genetics and high-energy physics.

An additional thought occurred to me today. I’ve been finding and reading various articles, trying to understand the range of viewpoints expressed regarding ‘absence of evidence is not evidence of absence’, and in Ch.10 of Modern Epidemiology by Rothman, Greenland and Lash, in the section on ‘Evidence of Absence of Effect’, they use p-value functions, closely related to confidence intervals, to illustrate problems with significance tests. It seems that confidence intervals might be better interpreted if the idea that an effect might be absent is not considered a possibility.

In other words, we are looking for evidence that a meaningful effect exists, or, if there is sufficient sample size, that the effect is not large enough to be meaningful, with the same implications of an effect that is absent (ignoring, for the moment, potential subgroups for whom this might not be the case).

In the end, it all comes down to what and how people think. And the arguments that statisticians have used to try to influence the way researchers make inferences, seen regularly now for more than 40 years, do not seem to have worked as well as we would have liked.

And this is usually adressed using Newman/Pearson tests (A/B tests, hypothesis tests). Noteably, this is again not about estimation. It’s just to make a decision between two alternatives (like A: effect is not yet relevant vs. B: effect is at least relevant) with some predifined confidence that chosen on some hopefully rational loss function.

If one would make that an estimation problem, confidence is the wrong concept. That’s only possible by adjusting a (hopefully rational) prior to the current data. Only for a large sample approximation there is usually not any relevant difference between a confidence and a credible interval (if the prior is not extreme).

Agreed. But what can we do? I don’t think that it’s good as it is. There is too much confusion, too much mess.

It is easy for those of us who have never had to grapple with the logical issues discussed in a mathematical analysis class, to get totally confused on this point. I am fortunate enough to have mentors with Ph.D level mathematics education to help me get over the hurdle.

These 2 papers also helped me understand the issues much better.

Abstractly, the cardinality of the reals is so vast, that the probability of picking any particular real number is zero. That also makes it difficult to discuss issues of strict equality of 2 real numbers, except in certain cases.

This is all second nature to mathematicians, but it is too easy for the rest of us to treat real numbers as rationals, when we should not.

Realistic nulls are bounds around 0. That bound is context-sensitive. If you look at how equivalence testing is done, you must do two one sided tests (TOST). In this scenario, the null is formally expressed as a bound.

Good explanation of TOST (commercial site):

In the more common scenario, point nulls are used as approximations of what a skeptic might say the true effect might be (added after Prof. Harrell’s response). If the effect we are looking for is reasonably large, we should be able to detect it in “small” sample studies. How “small” or “large” depends upon what sample size we can collect.

For a realistic look at a field where the point null should be 0, look up the work done in parapsychology by Jessica Utts. IIRC, she was former president of the American Statistical Association not too long ago.

Specifically, her article for Statistical Science on Replication and Meta analysis in Parapsychology is interesting: (PDF in link).

Either there is a small element of truth that effects studied by parapsychology exist, something wrong with our statistical test, our experiments, or our understanding of the assumptions.

I wouldn’t say that point nulls are used as approximations. They represent exactly a zero effect, and with a large enough sample size you can have power to detect any effect. Hence if the intervention is actually experienced by the subjects randomized to it in any way (biological, psychological, Hawthorne effect, etc.), it is logical to assume H_{0} is false and save time by not collecting data. As stated in Cohen’s 1994 “The Earth Is Round; P < 0.05”,

The nil hypothesis is always false. Tukey (1991) wrote that “It is foolish to ask ‘Are the effects of A and B different?’ They are always different—for some decimal place”. Schmidt (1992) … reminded researchers that, given the fact that the nill hypothesis is always false, the rate of Type I errors is 0%, not 5%, and that only Type II errors can be made.

I agree with everything you said/quoted from a declarative point of view; in the state of nature, the null is always false.

But from a decision perspective, those real differences might not be actionable. Agents do experiments to see if there are better decision rules than the ones they currently follow.

If an agent’s decision remains unchanged in spite of knowledge of a small departure the simple, null model, is it unreasonable to say “the agent acts as if H_{0} is true”?

It seems reasonable to me. That is all Neyman-Pearson theory promises – that we won’t too often go wrong, given our ignorance.

But I can see it is easy to confuse the two points of view. I had them confused for longer than I would care to admit.

Your post on communicating frequentist results, as well as that thread was helpful in that I have a better idea on how to communicate to others who might not have studied this topic as deeply as participants here.

1 Like

Thanks. I agree with your decision perspective. I just think that any method that lives by point hypotheses (which in the Bayes world I would never even entertain) will die by point hypotheses. You can’t have it both ways. If you want to abandon point hypotheses for interval hypotheses, or use second generation p-values that would improve things IMHO.


I like the perspective of 2nd generation p values (or TOST, etc) as it shifts focus on defining clinically important differences.

I think a lot of the use of point hypothesis in biomedicine pushes us away from thinking and defining meaningful clinical differences in various topics, because that is a difficult discussion (how do you decide when a value is trivial, if it has the potential to affect patient outcomes?). But having these discussions will be better in the long run.

It also fits better in how most physicians think; we intuitively understand their is some degree of measurement error and triviality that is equivalent to ineffective treatment (i.e. effects do not need to be exactly zero).

1 Like

And the Bayesian approach, because there is no such thing as multiplicities in most situations, clinical researchers can see and interpret an array of probabilities for one RCT, e.g.

  • P(OR < 1) : probability that the treatment B : A odds ratio < 1 (i.e., any efficacy)
  • P(OR < 0.975): prob. of non-trivial efficacy
  • P(OR < 0.95)
  • P(OR < 0.9)
  • P(OR < 0.8)
1 Like