I agree on the point regarding the limitations of permutation tests. Another objection is that the permutatoin test methods seem to focus mainly on p-values, and in the last 13 years I’ve reported a p-value on only three occasions. Aside from those 3 exceptions, they are not useful to me nor those whom I serve.
Yes, very insightful indeed. One point that @Stephen Senn states is that balance of group sizes is not important so long as analysis is done properly, But that depends entirely on the effect measure used. Given that the RR is used as is usual then from Bayes’ theorem:
which means that the magnitude of the RR is dependent on the degree of (im)balance of group sizes and for 1:1, 2:1 and 3:1 allocation would vary across the same posterior odds:
So perhaps this needs further consideration for those that use RRs in clinical trials
This is simply not true. With randomization, you get an unbiased estimate both of the distribution of outcomes under the intervention, and the distribution of outcomes under the control condition. This will be the case regardless of the size of the arm, and therefore, any effect measure (i.e. any functional of these two distributions) is also invariant to any random allocation scheme. The allocation scheme may affect precision of the estimate (1:1 is more efficient), but it will not affect the true value of any effect measure
I am not sure I fully understand your graph. It seems like you may be trying to say that the RR varies with different posterior odds. But the “posterior odds” is not something that is expected to be stable for different randomization allocations. If you randomize 3:1, this will of course result in a higher odds of treatment among the cases, simply because you designed your study to have higher probability of treatment. Odds(X|Y=1) therefore varies with allocation schemes, while RR (and OR, and any other effect measure) is invariant to it.
Moderator input: I don’t want to discuss ORs and RRs on this topic but rather to keep that discussion in the already existing topic that was made for exactly this purpose. In passing I will note that the RR used in clinical trials (e.g. # deaths in B / nB divided by # deaths in A / nA) is independent of allocation ratio. But let’s close that here.
I’m not so sure how defensible this is in reality. I think your comment above about the permutation test being more honest (and accurate) is correct.
Blockquote
What is important to understand is that the probability of a Type I error does not necessarily equal the chosen significance level … [following a discussion of \alpha under strict normality]
if the random sample does not come from a normal distribution, … then Type I error can differ, sometimes substantially from \alpha, so we do not have as much control over Type I error as the chosen \alpha leads us to believe.
The following section talks about constructing exact “confidence” (aka. compatibility) intervals from permutation tests. This was not mentioned, but it seems to leave open the possibility of constructing an exact “confidence distribution” (the set of all intervals at any \alpha level).
There was also a brief discussion of the Wilcoxon rank test, which is an elegant example of a permutation test.
The paper doesn’t address Frank’s criticism regarding adjustment for covariates, however.
Addendum: The following paper does develop a permutation approach that adjusts for covariates. I’m still in the process of reading it.
My guess is that model based adjustment can be more easily applied, while permutation analyses need to be re-derived for each study based upon the design.
Still, I think the logic behind permutation methods is important to understand, even if the modelling approach is preferred.
Tang, L., Duan, N., Klap, R., Asarnow, J.R. and Belin, T.R. (2009), Applying permutation tests with adjustment for covariates and attrition weights to randomized trials of health-services interventions. Statist. Med., 28: 65-74.
This recent dissertation links permutation methods to estimation of confidence distributions and rare event meta-analysis.
Related Threads
Permutation methods are a crucial component of valid adaptive estimation procedures, created by “inverting the test” – ie. searching for values not rejected. These frequentist techniques can be used in place of robust estimates and the invalid procedure of testing assumptions. Thomas O’Gorman’s books on this should be more widely recognized for those with an applied frequentist philosophy.
I had forgotten about the exact confidence intervals. Some are described in Hirji’s Exact Analysis of Discrete Data (CRC Press, 2006) which is on my bookshelf, but when you spend time with this book it is clear that there are key gaps in the literature. An example is a confidence interval for a stratified risk ratio - Hirji doesn’t offer an exact method for this, and I wonder if anyone has tried to invent one.
I’ve also heard of “exact logistic regression” but I don’t know enough about it to be able to say if it addresses Frank’s point about covariates.
“Exact” logistic regression is to regular logistic regression as Fisher’s “exact” test is to the Pearson \chi^2 test. An exact answer to the wrong question, and conservative.
R^3 just pointed out this thread to me. There’s a lot I could write but I think the most crucial points have been made, so I’ll just list a few items as I like to think of them:
-
The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).
-
The evolution of my views on the meaning of P-values and CI when there is no identifiable randomizer may be seen in going from
Randomization, Statistics, and Causal Inference on JSTOR
to
https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625 -
Despite the isomorphism, in practice sampling (selection into the study) and allocation (selection for treatment) are two distinct operations, each of which may or may not involve randomization. When they both do, some interesting divergences between Fisherian and Neyman-Pearson (NP) testing can arise, e.g., see
On the Logical Justification of Conditional Tests for Two-By-Two Contingency Tables on JSTOR
which brings us to… -
Permutation tests, exact tests, resampling etc.: These are general methods for getting P-values and CI when we are worried about the usual, simple asymptotic approximations breaking down in practice (which occurs more often than noticed). For these methods, covariate-adjusted P-values and CI can be obtained by resampling residuals from fitted adjustment models.
-
Nonetheless, in my experience switching from the usual Wald (Z-score) P and CI to likelihood-ratio or better still bias-adjusted score P and CI (as in the Firth adjustment) have always been as accurate as could be obtained without going on to use well-informed priors. Those priors translate into a penalized likelihood function, and the P-values and CI from that function are approximate tail and interval areas in the resulting marginal posterior distributions. This use of Bayes will be more frequency-accurate than ordinary frequentist P and CI (including permutation and “exact” P and CI) when the information in the prior is valid in a specific sense. Which brings up…
- Valid objections to Bayes seem to come down to the fact that invalid (misinformed) priors can ruin accuracy, and a fear which I share that “expert” priors are often invalid (typically prejudicially biased). In typical practice, the response of using reference priors comes down to using 2nd-order corrected frequentist P and CI, as in the Firth adjustment (which in logistic regression reduces to the Bayes posterior from a Jeffrey’s invariant prior).
- Finally, an important technical point which seems to have been overlooked in most published discussions (including mine): The Karl Pearson/Fisher observed tail-area P-value (their “value of P”) is not always equal to the realization of the random variable that is the minimum alpha for which rejection would occur (the P-value defined from Neyman-Egon Pearson testing). This is so even for the simplest normal-mean interval-testing problem. It happens when frequentist criteria are imposed that sacrifice single-sample coherence for hypothetical long-run optimization, notably uniformly most powerful unbiasedness (UMPU). Failure to notice this divergence has led to logically incorrect claims that compatibility interpretations of Fisherian P-values are incoherent, when the incoherence applies only to NP P-values. These claims thus flag a failure to actually read and understand Fisher and later discussions of P-values, which reject decision-theoretic foundations (and their criteria such as UMPU) in favor of information-theoretic and causal foundations. The conflict goes unnoticed in part because the P and CI from the Fisherian and NP approaches don’t diverge numerically in most everyday applications.
Thank you, Dr.Greenland, for your input to this thread. Since this is a forum for statisticians/epidemiologists, I trust that most other datamethods readers (Masters/PhDs) will be in a better position than I to learn from the points you’re making. Alas, as an MD, I don’t have the training needed to understand them.
I do have a very “high-level” question about point number 1 above, though:
The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).
Does this mean that you disagree with the authors of the articles linked earlier in this thread (e.g., Hirschauer, Locke Morgan), regarding the ability to extrapolate p value interpretations from a random sampling framework to the random allocation framework that is used in RCTs? Maybe you are, in fact, all agreeing with each other (?) and I just don’t even have enough training to recognize the agreement…
While those with extensive training in statistics might easily understand your statement that “randomization is random sampling from the total selected experimental group to determine those who will receive treatment,” I’m not at all sure that this equivalence would be understood by beginning students [nor am I sure whether this view is aligned with the one being espoused by Dr.Senn in his linked blog above (?)].
I would imagine that it’s very difficult for experts in any field to turn back the clock and try to recall what it’s like to learn their subject starting from no baseline understanding. The potential for confusion among students is great, especially in fields where the same terms are applied slightly differently by different experts and without explanation of how nuanced the interpretation of even a single word can be. My impression is that statistics might have bigger problems than other fields in this regard. Students who encounter a literature rife with apparent disagreement about fundamental concepts (around which much of the rest of the field seems to be built) are left thinking “What hope do the rest of us have?”
I don’t know of a simple, brief answer to your question, other than to say that I find logical and practical shortcomings in almost all discussions of statistical foundations and general applications. Consequently, I have felt forced to conclude that much if not most of the ongoing disputations are driven by psychosocial factors. That includes many passages in mathematical form, a form which is often used to give an unassailable air to claims while obscuring the dubious assumptions and interpretations needed to make the logic go through.
Most writings I see seem biased by the reading and experiences of the authors, the authorities they like, and the body of work they are committed to defend. Lest I be taken as a complete nihilist, however, I will say that I have been impressed with the soundness and depth of some writers on these topics, including George Box, IJ Good, and DR Cox in the last century and Stephen Senn into the present, all of whom recognized the need to avoid dogmatic commitments for or against philosophies and methods (although each had or have their preferences for teaching and practice, as do I). Even some of the more dogmatic writers (like DeFinetti, Lindley, and Royall) provided much worthwhile reading.
Applied experience inevitably must be very limited, e.g., Senn is very experienced in designed experiments whereas I am not. Instead I have had to focus on observational studies, which makes me less inclined to start from randomized designs and more inclined toward using explicit causal models (since randomization is unavailable to rule out uncontrolled causes of treatment). The point is that the applied area of the writer needs to be taken into account when judging why they choose to emphasize what they do.
A danger of authorities is that they are further narrowed by their career commitments to methods (through having developed, taught, or used them). That commitment can make them good at supplying rationales for those methods, but not so reliable as critics of methods, including their own and those they avoid because they don’t like the philosophies they attach to them, or see them as threats to their own methods. They can be especially harsh toward or omit mention altogether of methods that are promoted by those they don’t like personally. Then too, we all tend to overgeneralize from our own inevitably narrow experience and reading, sometimes into embarrassing error (a major hazard for those who act like they are polymaths - or worse, claim to be).
So as a supposed authority subject to all the warnings I just listed and more, I can only advise nonexperts to suspend judgement, avoid commitments about methods and philosophies, and don’t accept any conclusion if you can’t discern and accept all the assumptions and follow the argument that produced it. And of course keep reading, including the papers cited earlier. Among more elementary treatments I’ve coauthored that you may find useful are
Statistical tests, confidence intervals, and power: A guide to misinterpretations. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations on JSTOR
The need for cognitive science in methodology. https://academic.oup.com/aje/article/186/6/639/3886035
Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise | BMC Medical Research Methodology | Full Text
Surprise!
https://academic.oup.com/aje/article/190/2/191/5869593
To curb research misreporting, replace significance and confidence by compatibility. To curb research misreporting, replace significance and confidence by compatibility: A Preventive Medicine Golden Jubilee article - ScienceDirect
Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. https://journals.sagepub.com/doi/full/10.1177/02683962221105904
Well said! In Medicine, the same applies, as all decision making is inferential and this follows the same argument. This also explains why physicians don’t readily accept any conclusion if they can’t discern and accept all the assumptions and follow the argument that produced it - as in the “individual response” thread recently here at datamethods
I too am grateful for Dr. Greenland’s comments. It will take me weeks to read everything he linked, but my initial appraisal of his point #1 is that it provides a conceptual explanation for why we should have expected Lock Morgan’s results to be true (that random sampling-based inference is asymptotically equivalent to randomization-based inference). This would also completely defuse the more aggressive statements made by the permutation test people, such as Berger et al’s claim that “We conclude that the frequent preference for parametric analyses over exact analyses is without merit.” It would also negate my use of the term “bait and switch”, which I now see may be unfair. @EMSD’s point remains valid that the textbooks and courses need to actually explain what is going on here, rather than sweeping it all under the rug. This thread, however, has shown how much effort was needed to determine “what is going on here”!
@EMSD is also asking about the interpretation of the p-value. If the two frameworks are indeed isomorphic, then it would seem you get to choose one interpretation or the other depending on the nature of the actual study you’re working on. Anyway that seems to be what most people are actually doing? However, at an operational (study design/ execution) level it remains extremely important to be clear about which framework is intended and to follow it strictly - muddling them together would probably deny you access to either interpretation. (I admit this last is speculation - something that I hope is true, but can’t prove.)
As for nihilism, I have been accused of this on another thread on this forum. Such accusations are a way to shut down discussion and close minds. It makes no sense for someone who is working on real problems, and tentatively finding and offering solutions that others have found useful, to be called a nihilist. But to return to Dr. Greenland’s point, we can learn a lot from the most thoughtful writers on the foundations of our discipline, even if we are not completely persuaded by any of them.
I fail to see how that follows from what Sander wrote. The quote from Ernst in the Permutation Methods paper (who was also a co-author of the initial paper I linked to) described the problem: parametric assumptions that require large sample properties to hold for our finite, often very small sample, but we don’t know how quickly we converge to the limit.
Philip Good in his book Permutation, Parametric, and Boostrap Tests of Hypotheses (3rd ed, p. 153-154) gave a real world example (categorical data examined via Chi-square statistics) where the permutation p values and the large sample approximations differed by a factor of 10! The permutation test detected differences that asymptotic approximations did not.
I think it is unfortunate permutation methods are invariably perceived as tests; interval estimates and entire compatibility distributions (more commonly called “confidence” distributions) can be created from them. In the case of RCTs they substitute an unverifiable assumption with a design based proposition that can be taken as correct by construction (if you accept the research report as described).
Related article that @ESMD might appreciate:
I see this as a separate argument in favor of randomization procedures - they behave well for finite sized data sets compared to asymptotic methods (which is not at all surprising), in situations when the former even exist. This is a different argument than the “match your analysis to your design” principle that has consumed much of the oxygen on this thread.
Nonetheless, the Berger et al quote fails on this line of reasoning as well. Consider the example of stratified risk ratio, mentioned earlier. There is no exact confidence interval for it that has been published (at least last I checked). Others have offered at least two solutions for the asymptotic inference including Greenland and Robins (1985) and Gart and Nam (1988). One cannot simply dismiss these as “without merit” when the competing randomization-based inference isn’t even known to exist. This gets back to my point about nihilism - let’s appreciate people who are actually trying to solve these problems, rather than just those who wave around the one-true-ideology.
Some of the promoters of permutation “methods” are themselves responsible for this fallacy. The seminal Edgington/Onghena book I cited is titled Randomization Tests. See also the title of the Philip Good book cited by @R_cubed. However, his point remains sound - as alluded to above, I’m familiar with this part of the literature, but failed to make the connection earlier on this thread, since much of the polemics deals with p-values, as we’ve seen. I stand corrected.
The George Cobb paper posted by @R_cubed is very intriguing. Cobb was right in calling out our profession for exactly what @ESMD noticed - failing to justify to students what I called the “bait and switch” between two inferential frameworks. That part of his critique remains sound today. However it is only in retrospect (his paper appeared 10 years before Lock Morgan’s) that we can see that Cobb’s proposed solution need not be the only one. Indeed the Lock family themselves (see their intro textbook I cited above) breaks free of the traditions Cobb was upset with, by centering their text on both bootstrap and randomization inference concepts, not soley on randomization inference, as Cobb would have preferred. I do not know if the Lock^5 book directly addresses the justification for “bait and switch” but given Lock Morgan’s research on this very topic, I wouldn’t be surprised if they did. I don’t necessarily endorse the Lock^5 book (I’ve never actually inspected a copy) but present it as one possible alternative to Cobb’s proposal; I’m sure others could be imagined.
Finally, on a different thread @Pavlos_Msaouel reminded us that Piantadosi’s Clinical Trials (3d ed., Wiley), sec. 2.2.4, provides additional relevant discussion of the topic of this thread.
I agree about the George Cobb paper linked above- he was a great communicator. He provides a “big picture” view of frequentist inference and explains how and why stats came to be taught the way it has been for so many years. I sure wish that someone had presented this perspective back in 1990 as I suffered through an eye-wateringly confusing undergraduate stats course…
A few excerpts that stood out for me:
“My thesis is that both the content and the structure of our introductory curriculum are shaped by old history. What we teach was developed a little at a time, for reasons that had a lot to do with the need to use available theory to handle problems that were essentially computational… Intellectually, we are asking our students to do the equivalent of working with one of those old 30-pound Burroughs electric calculators with the rows of little wheels that clicked and spun as they churned out sums of squares.
Now that we have computers, I think a large chunk of what is in the introductory statistics course should go… Our curriculum is needlessly complicated because we put the normal distribution, as an approximate sampling distribution for the mean, at the center of our curriculum, instead of putting the core logic of inference at the center…”
“… First, consider obfuscation. A huge chunk of the introductory course, at least a third, and often much more, is devoted to teaching the students sampling distributions…Why is this bad? It all depends on your goal. The sampling distribution is an important idea, as is the fact that the distribution of the mean converges to a normal. Both have their place in the curriculum. But if your goal is to teach the logic of inference in a first statistics course, the current treatment of these topics is an intellectual albatross…”
“…There’s a vital logical connection between randomized data production and inference, but it gets smothered by the heavy, sopping wet blanket of normal-based approximations to sampling distributions…”
“…We may be living in the early twenty-first century, but our curriculum is still preparing students for applied work typical of the first half of the twentieth century…”
“…What I have trouble getting down my persnickety throat is the use of the sampling model for data from randomized experiments. Do we want students to think that as long as there’s any sort of randomization, that’s all that matters? Do we want them to think that if you choose a random sample but don’t randomize the assignment of treatments, it is still valid to conclude that the treatment caused the observed differences? Do we want them to think that because the assignment of treatments was randomized, then they are entitled to regard the seven patients in the example as representative of all patients, regardless of age or whether their operation was a tonsillectomy or a liver transplant? Do we want students to leave their brains behind and pretend, as we ourselves apparently pretend, that choosing at random from a large normal population is a good model for randomly assigning treatments?..”
“… I’ve become convinced that a huge chunk of statistical theory was developed in order to compute things, or approximate things, that were otherwise out of reach. Until very recently, we had no choice but to rely on analytic methods. The computer has offered to free us and our students from that, but our curriculum is at best in the early stages of accepting the offer…”
“…We need a new curriculum, centered not on the normal distribution, but on the logic of inference. When Copernicus threw away the old notion that the earth was at the center of the universe, and replaced it with a system that put the sun at the center, his revolution brought to power a much simpler intellectual regime. We need to throw away the old notion that the normal approximation to a sampling distribution belongs at the center of our curriculum, and create a new curriculum whose center is the core logic of inference.
What is that core logic? I like to think of it as three Rs: randomize, repeat, reject. Randomize data production; repeat by simulation to see what’s typical and what’s not; reject any model that puts your data in its tail.
*The three Rs of inference: randomize, repeat, reject
1. Randomize data production
• To protect against bias
• To provide a basis for inference
- random samples let you generalize to populations
- random assignment supports conclusions about cause and effect
2. Repeat by simulation to see what’s typical
• Randomized data production lets you re-randomize, over and over, to see which outcomes are typical, which are not.
3. Reject any model that puts your data in its tail…”
Brilliant. Thanks so much posting this, which I had not seen before.
I agree that this is a good way to understand frequentist inference, but I think it doesn’t do justice to the Bayesian perspective that places conditioning on what is known as fundamental.
It is easy to generate useless procedures with frequency guarantees. Sensible use of frequentist methods rely on the likelihood principle in the background. For example:
I don’t know if it is possible, but I wish there was a course that took the perspective of Herman Chernoff – develop an initial Bayesian decision theoretic perspective, then develop frequentist procedures guided by those constraints.
Could you link the podcast?