Most reasonably hypothesised effects cannot be exactly zero?


#1

A few authors over the years have suggested that, in most cases, effects (and all effects are causal by definition) cannot be exactly zero, at least in fields like health and the social sciences. For example:

“All we know about the world teaches us that the effects of A and B are always different — in some decimal place — for any A and B. Thus asking “Are the effects different?” is foolish.”
(Tukey (1991); doi:10.1214/ss/1177011945)

“In any hypothesis testing situation, a reason can be found why a small difference might exist or a small effect may have taken place.”
(Nester (1996); doi:10.2307/2986064)

“Who cares about a point null that is never true?”
(Little RJ. Comment. The American Statistician. 2016;70(suppl))

“Also, remember that most effects cannot be zero (at least in social science and public health)”
(Gelman, Carlin (2017); doi:10.1080/01621459.2017.1311263)

“As is often the case, the null hypothesis that these physical changes should make absolutely zero difference to any downstream clinical outcomes seems farfetched. Thus, the sensible question to ask is “How large are the clinical differences observed and are they worth it?” — not “How surprising is the observed mean difference under a [spurious] null hypothesis?””
(Gelman, Carlin, Nallamothu (2018); http://www.stat.columbia.edu/~gelman/research/unpublished/Stents_submitted.pdf)

While this has obvious relevance for the usual conception of the null hypothesis, and this was the context for each of the above quotes, it may have some additional influence on the inferences researchers make - perhaps framing the question in terms of whether the difference is clinically meaningful, rather than asking if an effect or difference exists, would change how results are sometimes viewed and interpreted?

Clearly, there would be hypothesised effects that really are exactly zero, meaning 0.00000…, or are close enough that they cannot be distinguished from zero. But these cases would not seem to be common outside of the physical sciences.

I’ve posed this topic to see if anyone:
a. disagrees
b. thinks that it doesn’t matter
c. has a different interpretation of the above statements, or
d. has a thought on how such a viewpoint might affect inferences
(or e. anything else)

thanks,
Tim


#2

Tim, under the ‘anything else’ category, I’d add only Paul Meehl’s famous quote [1]:

Since the null hypothesis is quasi-always false, tables summarizing research in terms of patterns of “significant differences” are little more than complex, causally uninterpretable outcomes of statistical power functions.

  1. Meehl, Paul E. “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology.” Journal of Consulting and Clinical Psychiatry 46 (1978): 806–34. http://www3.nd.edu/~ghaeffel/Meehl(1978).pdf

#3

The whole testing approach is not about estimation. Testing is a fall-back if estimation is unfeasable, either because it’s not quite clear or not relevant what the estimates would mean (at leat practically; one might think of statistics like R², Chi², F etc. that may be hard or impossible to interpret in any practically sensible or understandable way), or because the information in the data is insufficient to allow an estimation with a useful accuracy and precision.

If we fall back on testing, the questiion can not be whether or not an effect is precisely zero. This would just mean that we treta a test like an approch to estimate the effect. We often do so anyway, and this is a major source of confusion, imho. The test -specifically the significance Test [calculating a p-value according to RA Fisher] - is, instead, a procedure to judge the amount of information in the data at hand relative to a particular restriction in a particular (statistical) model. It was never, and should never be, the aim of the test to demonstrate an effect. The test is not about the effect (which is, by the way, neccesarily unknown to the test!) but about the data (more precisely about the “relevant” information in the data, w.r.t. to a restriction in a model). Thus, a “significant” test is not a demonstration of a “non-zero” effect. It is rather a demonstration that the data (the information provided by the data w.r.t. the restriction in the given model) is sufficient so that the effect is “visible with good confidence”. If the tested hypothesis is a “zero-difference” hypothesis, then a significant result means that we may place confidence in the interpretation of the sign of the effect.


#4

thanks David - another rare example I can add to my collection.


#5

thanks Jochen. My perception is that testing and estimation have generally been seen as part of the same process, but then you say ‘we often do so anyway’. You have an interesting point of view and I probably agree with quite a bit of it, though it’s hard to tell for certain without seeing your philosophy/approach applied to a variety of real problems.

I suspect it may be too rigid and too abstract to be taken up by the majority of researchers, however. Though nothing is certain in the evolution of ideas!


#6

I don’t really see that this is “my philosophy” or “my approach”.

Testing is to find or judge the statistical significance of data under a given model. At no point there is made an estimation. The significance is usually determined by comparing the maximum likelihood of the data under the full model to that under the restricted model. The coefficient values under which the likelihoods become maximal are called “maximum likelihood estimates”, MLEs (NB: the least.squares estimates, LSEs, are a special case of the MLEs under the Gaussian error model). These are called “estimates”, but they don’t have the meaning to be “good” or plausible values for the effects (coded by the coefficients). They are “just” those values for which the maximum of the likelihood of the observed data is obtained.

On the other hand, estimation is to find “good” or plausible values of the coefficients, given the data and whatever else we know about the system. Here we look for coefficient values with high probability, given the available information, including the observed data. When the amount of information contributed by the observed data is overwhelming other aspects, the most probable values will be close to MLEs. So as long we can rely on large-sample approximations, MLEs are estimates in that sense. However, in my field of research I usually see samples sizes of 3-4, where we can be really wrong when we assume that an MLE would indicate a “good” or plausible estimate (as a value that would be considered probable, given the information we have).

Gauss knew that, and he interpreted his LSEs as estimates because they were seen as large-sample approximations. Bayes and Laplace noted that probability calculations up to their time were only about data, not about coefficients, although the scientific interest was usually in the coefficients (=estimation) and not in predicting data based on knwon coefficients. However, talking about estimates requires a “frame” specifying our knowledge about the system (today called “priors”), and there is no objective way to specify such a frame.

So that’s all old stuff and well-known, but heavily ignored after over a century of focussing on testing.


#7

Our experience certainly influences our point of view. Mine is in epidemiology and, as such, I have never seen testing that did not also focus on the estimates produced. But I can imagine areas of research where would be appropriate, such as genetics and high-energy physics.

An additional thought occurred to me today. I’ve been finding and reading various articles, trying to understand the range of viewpoints expressed regarding ‘absence of evidence is not evidence of absence’, and in Ch.10 of Modern Epidemiology by Rothman, Greenland and Lash, in the section on ‘Evidence of Absence of Effect’, they use p-value functions, closely related to confidence intervals, to illustrate problems with significance tests. It seems that confidence intervals might be better interpreted if the idea that an effect might be absent is not considered a possibility.

In other words, we are looking for evidence that a meaningful effect exists, or, if there is sufficient sample size, that the effect is not large enough to be meaningful, with the same implications of an effect that is absent (ignoring, for the moment, potential subgroups for whom this might not be the case).

In the end, it all comes down to what and how people think. And the arguments that statisticians have used to try to influence the way researchers make inferences, seen regularly now for more than 40 years, do not seem to have worked as well as we would have liked.


#8

And this is usually adressed using Newman/Pearson tests (A/B tests, hypothesis tests). Noteably, this is again not about estimation. It’s just to make a decision between two alternatives (like A: effect is not yet relevant vs. B: effect is at least relevant) with some predifined confidence that chosen on some hopefully rational loss function.

If one would make that an estimation problem, confidence is the wrong concept. That’s only possible by adjusting a (hopefully rational) prior to the current data. Only for a large sample approximation there is usually not any relevant difference between a confidence and a credible interval (if the prior is not extreme).

Agreed. But what can we do? I don’t think that it’s good as it is. There is too much confusion, too much mess.