that’s go to be the word of the day. it feels like they have been saying these things forever, it’s so perplexing to me, like one of those mad films where everything is repeated anew…
Thanks, Frank, for digging in & offering these perspectives!
I always learn something from your posts on statistical issues.
Question regarding multiplicity adjustments – in what scenario would you insist that they be done from a frequentist POV?
If I understand this (and some of your prior writings) on the topic, it seems like they would be mandatory for post hoc hypotheses after data collection, but not necessary from a priori specified hypotheses.
Would it also be accurate to assume that in the circumstances where multiplicity adjustments be made to hypothesis tests, those same confidence/compatibility intervals should also be adjusted?
Thanks for this helpful commentary!
It seems confusing how the guidelines create somewhat of a dichotomy of when to use vs. not use p-values (i.e. when adjusted for multiplicity/false discovery rate vs. not when they aren’t; in hierarchical testing when significant vs. not thereafter).
While I understand the much-needed movement away from p-values as an arbitrary cutoff for statistical significance, I’m not so sure that creating “rules” for when to use and not use them, as the guidelines do, will solve the problem. Is there not merit to increased attention on how they are taught, used, and interpreted in conjunction with CIs, during both study design and analysis? To me, it seems like the effort of identifying situations where one should (or should not) report p-values detracts from the overall need for a better understanding of what a pvalue is and how it should be considered in medical decision making
woth emphasising: this recent paper distinguishes between obs data and clinical trials: There is still a place for significancetesting in clinical trials
“the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could makethings worse” etc
P values in clinical trials can always be justified (and calculated) by permutation tests based upon the design. Wasn’t that Fisher’s key justification for the randomization procedure?
The same cannot be said for observational studies. I’m not sure how big of an error this is, however. At least in simulations, the t test results generally matches the permutation test results.
Addendum: I re-read the first paper I posted by Berger, Lunneborg, Ernst, and Levene who argue that RCTs should be examined using permutation tests. I find their reasons for permutation techniques (as well as approximate permutation methods) in this context unassailable and logically correct.
Blockquote
If credibility for the parametric test derives from assurances that its p-value will likely be close to the corresponding exact one, then this is tantamount to an admission that the exact test is the gold standard (or, perhaps, the platinum standard). Approximate tests cannot be any more exact than the exact tests they are trying to approximate and, as approximations to the exact tests, are correct only to the extent that they agree with the exact test. A “heads I win, tails you lose” situation then arises, because if the parametric and exact tests lead to essentially the same inference, then this is as much an argument in favor of the exact test as it is for the parametric test, and there is no benefit to using the parametric test. If they do not agree, then the exact test needs to be used.
It seems to me that these guidelines get into all sort of problems and inconsistencies that are fundamental issues related to the frequentist approach.
Correcting for multiple testing has never made any sense to me and adjusting confidence intervals for the number of tests seems equally arbitrary and meaningless.
It has been said many time before, but, if one wishes to frame a question in the framework of null hypothesis testing the primary question of interest is “what is the probability that the null is true given the data”. @f2harrell I know you are no fan of thinking in this way because “apart from with homeopathy the null hypothesis is never true”. But, I think it a useful way of framing a question.
Approximate Bayes approaches applied to a specified prior at least enables an approximate probablistic answer to that question, and in my experience can be understood by the non-specialist statistician (at least it makes sense). Wakefield’s Bayes False Discovery Probability is easy to calculate from standard frequentist stats. Such an approach helps with mutliple tests and subgroup analyses (by definition, one prior for these hypotheses have to be smaller than that for the primary endpoint (for which there ought ot be equipoise). This approach also helps one to understand why a p of (say) 0.02 is less likely to replicate if produced by a smalll study than if produced by a large study.
I would say that if doing frequentist analysis and the questions are not marginal, then multiplicity adjustment is called for, e.g. when making statements such as “there exists an endpoint affected by this treatment” or “there exists a subgroup for which this treatment is effective”.
I’m less sure that multiplicity adjustments should apply equally to compatibility/confidence intervals, especially when users are not falling into the trap of just caring whether or not the interval includes the null.
Stephen Senn has a great talk titled ‘70 Years Old and Still Here: The Randomized Clinical Trial and its Critics’ This talk should have much more than 99 views! https://www.youtube.com/watch?v=uTrVhIwy8e8
I’ll definitely watch this later. I always learn something from Stephen Senn, even when I suspect there is room for educated disagreement.
Suffice it to say – the misunderstanding of randomization from a decision theory POV, along with a categorical interpretation of “evidence hierarchies” has IMO, done more harm than good for the typical consumer of the research literature – to put it mildly.
I’d hope that so-called meta-science can start looking at things from the point of view of statistical decision theory, and we can have more constructive debates about what decisions are supported by evidence.
Addendum: Everyone should watch that Senn video on randomization! Still, I think he picked on weak critics of randomization.
But perhaps that deserves a separate topic.
Have you any references for tougher critics of randomization? Please share as I’d like to read their arguments.
It is more of a normative and theoretical vs applied and practical issue. Bayesian actors do not need randomization from a strict decision theory POV, if it is for his/her own learning.
That doesn’t mean it is wrong to randomize, only that the experiment will be less efficient than a Bayesian optimal one. Frequentists will counter that the “optimal” experiment may not be derivable from our current knowledge.
And there are a number of practical Bayesians who defend the use of randomization as a measure to reduce the effect of the prior on the posterior – a type of robustness measure, in case the prior was in error somehow.
Senn has written “Block what you can, randomize what you cannot.” He also is not a fan of minimization, to put it mildly.
Randomization has more justification when the experiment is designed to persuade a skeptical decision maker. (Although there are some strong arguments that it isn’t needed even in this scenario). At least in human experimentation, it is a credible way to control bias.
I’ve studied this issue because, at least in my field – neither myself, nor my colleagues will be able to run studies the size of even a modest pharmaceutical study. A larger, observational study is much more likely.
Randomization in small samples (ie. less than 200 per arm) leads to excess heterogeneity, making extraction of information from the few repeatable, small sample experiments more difficult, and less likely to be included in any research syntheses.
In terms of “evidence hierarchies” there is a common overvaluation of randomization, and undervaluation of observational evidence. There have been very interesting, Bayesian decision theory derived clinical trials that have been conducted without randomization.
There is also a design known as minimization, with algorithms derived by Taves, Pocock, etc. in the 70’s. But less than 2% of clinical trials use minimization.
This ultimately comes down to the economic value of the information, which is seldom formally accounted for in a research program at an aggregate/policy level.
That is an informal write up ATM. I think this is best in a separate thread that I might post over the weekend.
I would also be interested in your take on https://www.ncbi.nlm.nih.gov/pubmed/30537407 “Guidelines for reporting of statistics for clinical research in urology”
… which seems to me to be much more readable and practical.
Wow. Those guidelines are fantastic. I added a link to them in the Vanderbilt Checklist for Authors. These guidelines are much more actionable than the NEJM guidelines in my view, and do much more to foster good statistical practice (frequentist practice, at least). Here is the full article, which thankfully is open access.
Dear Frank, could you comment on 3.4?
Investigators often interpret confidence intervals in terms of hypotheses. For instance, investigators might claim that there is a statistically significant difference between groups because the 95% confidence interval forthe odds ratio excludes 1. Such claims are problematic because confidence intervals areconcerned with estimation, not inference.
I don’t get that, given my understanding is right that the construction of confidence intervals is the inverse to hypothesis tests (an (1-a)-CI is defined as the interval of hypotheses being rejected at p<a for the given data).
I do understad the second argument that --due to using inexact inversions-- the limits of the CI may not match the significance level of a test exactly. To my opinion, the given example (CI starts at 1.03 but p = 0.07) is not really dramatic, as it is a borderline case anyway). Here, test and CI were calculated based on different methods, so it’s not unexpected to get different results. The main concern here should be why different methods are employed at all. I think that it would be obviousely strange if one would report a p-value for a mean difference based on a t-test but calculate the CI based on the z-distribution (particularily for smaller samples).
Great question Jochen, and one that I hope those who have read the literature on this better than I will respond to. As you said you can define a confidence interval as the set of all parameter values that if null hypothesized would not be rejected at say the 0.05 level. But since we are moving away from a p < 0.05 world, much has been written about why one should avoid looking to see if the confidence interval includes the null. What exactly we are to do with confidence intervals is left a bit unspecified, especially since they do not have a ready interpretation for the experiment at hand but have only long-run operating characteristics over an infinite sequence of identical experiments.
In addition to avoiding confidence intervals as restated hypothesis tests, there are specific confidence interval fallacies, and I think Sander Greenland wrote a definitive paper on this, besides this one. This may be of interest, and especially this.
A major limitation of confidence intervals is that the limits are not controllable by the analyst. Most researchers will like to know things like the following. What is the probability that the treatment effect is between 0 and 7? A Bayesian posterior probability of [0,7] is easily calculated, but there is no way to get a “confidence level” for [0,7]. Instead one specifies the coverage probability and gets different limits as a solution.
I also found this blog post by Richard Morey to be a very thought-provoking supplement to the article and slides you linked to.
I can’t argue with his logic, but I would say that emphasizing the CI minimizes the lazy approach I see too often in my field – scan the abstract or paper for reports of hypothesis test results, ignore the actual estimate, and then draw grand conclusions (ie. absence of evidence fallacy).
I agree. I think the take-home message is that (strictly speaking) frequentist confidence intervals show the compatibility between the data and the model-based expectation along the data-space axis in which the statistic used to calculate the confidence intervals varies. If the model is barely compatible with the data then the confidence intervals will be narrow regardless of the precision of the data. Although this can happen, it is an uncommon scenario (at least in oncology) and we can get away in practice by pretending that the confidence interval width only measures precision.
I can see though why some Bayesians would scoff at this inferential incoherence.