Random sampling versus random allocation/randomization- implications for p-value interpretation

ChristopherTong · November 9, 2022, 7:06am

The random sampling model of statistical inference is actually applicable to major areas of statistics, such as survey methodology, statistical quality control, ecological abundance studies, etc…just not to RCTs and many designed experiments. (There are types of epidemiological studies such as medical surveys that are closer to the random sampling model though.) Unfortunately the standard textbooks tend to introduce statistical inference using only the random sampling model, and never get around to explaining the random assignment model and the different interpretation that attends it. Thus it seems to me that the textbooks are carrying out a bait-and-switch. Why? Is it because most stat methods (except for randomization tests/permutation tests/exact tests) are derived from a random sampling paradigm? Is it even possible to recast all such derivations in a random assignment setting?

In actual RCTs people don’t often use randomization tests, whose interpretation fits the random assignment design of the study, but rather we use conventional frequentist methods that are derived from random sampling concepts. Edgington and Onghena (cited above) seem to suggest that we do this in order to approximate the p-value from the randomization test that we should have done but didn’t, thus the accompanying random sampling interpretation of the resulting p-value isn’t relevant. The Berger et al paper posted by @R_cubed (in his initial post on this thread) seems to be arguing along similar lines.

It seems to me that promoters of randomization/permutation/exact tests (Lehmann’s 1975 nonparametrics book might be a rare exception) are the most willing to even broach this topic. Do those who are not in the permutation test camp agree that our methods are just approximating their methods? And if so, why don’t our courses and books admit it? And if not, is it actually possible to derive all conventional frequentist methods purely from a random assignment paradigm? If yes, why aren’t approximatey half our textbooks doing that instead of random sampling? Now maybe I’m the one who is confused

f2harrell · November 9, 2022, 12:06pm

But it needs to be pointed out again how out of place a sampling frame of mind is for the problems you mentioned other than surveys. Bayes’ attempt to reveal hidden data generating mechanisms is the more applicable approach and is much, much cleaner in terms of understanding concepts IMHO.

R_cubed · November 9, 2022, 3:41pm

This is an interesting question, because even if one is an ardent supporter of permutation tests for RCTs, for a large enough sample, exhaustive enumeration of all permutations is too computationally complex.

What is done then? Take a random sample of possible permutations!

The statistician EJG Pitman invented these tests long before we had computer power to conduct them.

Getting back to the last question by @ESMD:

You have the right idea. You have gone out and recruited a sufficient number of patients you wish to test an intervention on. You randomly assign them to treatment vs. control, then compute a group level difference.

To do the permutation test, simply use a computer program to take your observed data switch the labels (tx vs control) on the observations, compute a difference, and store the result. If you can compute all possible permutations, you have effectively re-done your experiment under the no effect hypothesis.

If your sample is too big for exhaustive enumeration, my comment above applies. You might think of your observed sample as a member of a population of all possible permutations of treatment assignments.

So, while I think the random sampling vs random assignment distinction is very important, and that it is misleading to link p values exclusively to a sampling frame of reference, sampling theory methods are useful and can show up in surprising places.

ChristopherTong · November 10, 2022, 3:35am

Frank, your first sentence seems to be the point that @EMSD is making and that I am supporting. It’s worth thinking about since the medical literature is predominantly frequentist, so we have to engage with it.

The sampling paradigm is used to derive most frequentist stat methods, but these are then applied to data from randomized trials with no justification. This is a non-sequitur. The permutation test people offer a rescue - if we can only admit that we are doing this to approximate the “correct” (to them) approach of using randomization tests, which are derived from a random assignment paradigm. Is this the right way to think about why the medical literature is the way it is? If so then @EMSD is correct that the textbooks should clearly state it, and by not doing so, they cultivate confusion.

My above comment proposes an alternate explanation- that it is possible to derive standard frequentist methods from a random assignment paradigm, but I’ve never seen this done, and I’m not smart enough to attempt it myself.

The third (and most cynical) explanation is that this is all just an insincere bait-and-switch by those who teach and write about frequentist stats. I hope there are other explanations that I didn’t think of.

It is worth quoting the final paragraph of the S. Senn blog that is cited in the OP:

What both methods have in common is that there is a theory to relate the inferences that may be made to the way that allocation (experiments) or selection (surveys) has been conducted. Whether the design should dictate the analysis, whether the intended analysis should guide the design or whether some mixture of the two is appropriate, is a deep and fascinating issue on which I and others more expert than me have various opinions but this blog has gone on long enough, so I shall stop here.

One way to read between the lines here is that current practice mismatches analysis to design - using sampling-based inference to analyze randomization-based experients. I would welcome clarification from him though.

ChristopherTong · November 10, 2022, 3:38am

Very good point, and this seems to be a more honest way to approximate a permutation test than what is usually done in practice, no?

ChristopherTong · November 10, 2022, 4:05am

Thank you @EMSD for the addendum regarding Lock Morgan’s paper. I should have known that a member of the Lock family has thought deeply about these points (the Lock^5 intro stats book is based on both randomization tests and bootstrap, rather than conventional parametric tests - link below). She seems to be supporting the view that sampling based methods and randomization-based methods can approximate each other, as they are asymptotically equivalent for large samples. So the “bait and switch” is defensible, if only authors and teachers would take the time to talk about this.

The Lock Morgan paper is excellent and a badly needed source of illumination. I will nitpick one point though -she attributes the random sampling framework to Neyman and Pearson 1928, but Fisher wrote about it in 1922.

f2harrell · November 10, 2022, 12:23pm

To a large extent that is what has happened, and is partially a reflection of a systemic allergy to Bayes.

I reject permutation tests because they don’t extend well to covariate adjustment (hence they don’t take outcome heterogeneity into account) or complex longitudinal problems.

ChristopherTong · November 11, 2022, 12:41am

I agree on the point regarding the limitations of permutation tests. Another objection is that the permutatoin test methods seem to focus mainly on p-values, and in the last 13 years I’ve reported a p-value on only three occasions. Aside from those 3 exceptions, they are not useful to me nor those whom I serve.

f2harrell · November 11, 2022, 12:14pm

Rod Little has a seminar on Nov 22 about advantages of Bayes for surveys.

s_doi · November 11, 2022, 2:59pm

Yes, very insightful indeed. One point that @Stephen Senn states is that balance of group sizes is not important so long as analysis is done properly, But that depends entirely on the effect measure used. Given that the RR is used as is usual then from Bayes’ theorem:

which means that the magnitude of the RR is dependent on the degree of (im)balance of group sizes and for 1:1, 2:1 and 3:1 allocation would vary across the same posterior odds:
unequal allocation

So perhaps this needs further consideration for those that use RRs in clinical trials

AndersHuitfeldt · November 11, 2022, 4:37pm

This is simply not true. With randomization, you get an unbiased estimate both of the distribution of outcomes under the intervention, and the distribution of outcomes under the control condition. This will be the case regardless of the size of the arm, and therefore, any effect measure (i.e. any functional of these two distributions) is also invariant to any random allocation scheme. The allocation scheme may affect precision of the estimate (1:1 is more efficient), but it will not affect the true value of any effect measure

I am not sure I fully understand your graph. It seems like you may be trying to say that the RR varies with different posterior odds. But the “posterior odds” is not something that is expected to be stable for different randomization allocations. If you randomize 3:1, this will of course result in a higher odds of treatment among the cases, simply because you designed your study to have higher probability of treatment. Odds(X|Y=1) therefore varies with allocation schemes, while RR (and OR, and any other effect measure) is invariant to it.

f2harrell · November 11, 2022, 10:30pm

Moderator input: I don’t want to discuss ORs and RRs on this topic but rather to keep that discussion in the already existing topic that was made for exactly this purpose. In passing I will note that the RR used in clinical trials (e.g. # deaths in B / nB divided by # deaths in A / nA) is independent of allocation ratio. But let’s close that here.

R_cubed · November 12, 2022, 1:54am

I’m not so sure how defensible this is in reality. I think your comment above about the permutation test being more honest (and accurate) is correct.

Blockquote
What is important to understand is that the probability of a Type I error does not necessarily equal the chosen significance level … [following a discussion of \alpha under strict normality]
if the random sample does not come from a normal distribution, … then Type I error can differ, sometimes substantially from \alpha, so we do not have as much control over Type I error as the chosen \alpha leads us to believe.

The following section talks about constructing exact “confidence” (aka. compatibility) intervals from permutation tests. This was not mentioned, but it seems to leave open the possibility of constructing an exact “confidence distribution” (the set of all intervals at any \alpha level).

There was also a brief discussion of the Wilcoxon rank test, which is an elegant example of a permutation test.

The paper doesn’t address Frank’s criticism regarding adjustment for covariates, however.

Addendum: The following paper does develop a permutation approach that adjusts for covariates. I’m still in the process of reading it.

My guess is that model based adjustment can be more easily applied, while permutation analyses need to be re-derived for each study based upon the design.

Still, I think the logic behind permutation methods is important to understand, even if the modelling approach is preferred.

Tang, L., Duan, N., Klap, R., Asarnow, J.R. and Belin, T.R. (2009), Applying permutation tests with adjustment for covariates and attrition weights to randomized trials of health-services interventions. Statist. Med., 28: 65-74.

This recent dissertation links permutation methods to estimation of confidence distributions and rare event meta-analysis.

Related Threads

Permutation methods are a crucial component of valid adaptive estimation procedures, created by “inverting the test” – ie. searching for values not rejected. These frequentist techniques can be used in place of robust estimates and the invalid procedure of testing assumptions. Thomas O’Gorman’s books on this should be more widely recognized for those with an applied frequentist philosophy.

ChristopherTong · November 12, 2022, 6:48am

I had forgotten about the exact confidence intervals. Some are described in Hirji’s Exact Analysis of Discrete Data (CRC Press, 2006) which is on my bookshelf, but when you spend time with this book it is clear that there are key gaps in the literature. An example is a confidence interval for a stratified risk ratio - Hirji doesn’t offer an exact method for this, and I wonder if anyone has tried to invent one.

I’ve also heard of “exact logistic regression” but I don’t know enough about it to be able to say if it addresses Frank’s point about covariates.

f2harrell · November 12, 2022, 12:00pm

“Exact” logistic regression is to regular logistic regression as Fisher’s “exact” test is to the Pearson \chi^2 test. An exact answer to the wrong question, and conservative.

Sander_Greenland · November 12, 2022, 5:40pm

R^3 just pointed out this thread to me. There’s a lot I could write but I think the most crucial points have been made, so I’ll just list a few items as I like to think of them:

The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).
The evolution of my views on the meaning of P-values and CI when there is no identifiable randomizer may be seen in going from
Randomization, Statistics, and Causal Inference on JSTOR
to
https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529625
Despite the isomorphism, in practice sampling (selection into the study) and allocation (selection for treatment) are two distinct operations, each of which may or may not involve randomization. When they both do, some interesting divergences between Fisherian and Neyman-Pearson (NP) testing can arise, e.g., see
On the Logical Justification of Conditional Tests for Two-By-Two Contingency Tables on JSTOR
which brings us to…
Permutation tests, exact tests, resampling etc.: These are general methods for getting P-values and CI when we are worried about the usual, simple asymptotic approximations breaking down in practice (which occurs more often than noticed). For these methods, covariate-adjusted P-values and CI can be obtained by resampling residuals from fitted adjustment models.
Nonetheless, in my experience switching from the usual Wald (Z-score) P and CI to likelihood-ratio or better still bias-adjusted score P and CI (as in the Firth adjustment) have always been as accurate as could be obtained without going on to use well-informed priors. Those priors translate into a penalized likelihood function, and the P-values and CI from that function are approximate tail and interval areas in the resulting marginal posterior distributions. This use of Bayes will be more frequency-accurate than ordinary frequentist P and CI (including permutation and “exact” P and CI) when the information in the prior is valid in a specific sense. Which brings up…

Valid objections to Bayes seem to come down to the fact that invalid (misinformed) priors can ruin accuracy, and a fear which I share that “expert” priors are often invalid (typically prejudicially biased). In typical practice, the response of using reference priors comes down to using 2nd-order corrected frequentist P and CI, as in the Firth adjustment (which in logistic regression reduces to the Bayes posterior from a Jeffrey’s invariant prior).

Finally, an important technical point which seems to have been overlooked in most published discussions (including mine): The Karl Pearson/Fisher observed tail-area P-value (their “value of P”) is not always equal to the realization of the random variable that is the minimum alpha for which rejection would occur (the P-value defined from Neyman-Egon Pearson testing). This is so even for the simplest normal-mean interval-testing problem. It happens when frequentist criteria are imposed that sacrifice single-sample coherence for hypothetical long-run optimization, notably uniformly most powerful unbiasedness (UMPU). Failure to notice this divergence has led to logically incorrect claims that compatibility interpretations of Fisherian P-values are incoherent, when the incoherence applies only to NP P-values. These claims thus flag a failure to actually read and understand Fisher and later discussions of P-values, which reject decision-theoretic foundations (and their criteria such as UMPU) in favor of information-theoretic and causal foundations. The conflict goes unnoticed in part because the P and CI from the Fisherian and NP approaches don’t diverge numerically in most everyday applications.

ESMD · November 12, 2022, 10:11pm

Thank you, Dr.Greenland, for your input to this thread. Since this is a forum for statisticians/epidemiologists, I trust that most other datamethods readers (Masters/PhDs) will be in a better position than I to learn from the points you’re making. Alas, as an MD, I don’t have the training needed to understand them.

I do have a very “high-level” question about point number 1 above, though:

The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).

Does this mean that you disagree with the authors of the articles linked earlier in this thread (e.g., Hirschauer, Locke Morgan), regarding the ability to extrapolate p value interpretations from a random sampling framework to the random allocation framework that is used in RCTs? Maybe you are, in fact, all agreeing with each other (?) and I just don’t even have enough training to recognize the agreement…

While those with extensive training in statistics might easily understand your statement that “randomization is random sampling from the total selected experimental group to determine those who will receive treatment,” I’m not at all sure that this equivalence would be understood by beginning students [nor am I sure whether this view is aligned with the one being espoused by Dr.Senn in his linked blog above (?)].

I would imagine that it’s very difficult for experts in any field to turn back the clock and try to recall what it’s like to learn their subject starting from no baseline understanding. The potential for confusion among students is great, especially in fields where the same terms are applied slightly differently by different experts and without explanation of how nuanced the interpretation of even a single word can be. My impression is that statistics might have bigger problems than other fields in this regard. Students who encounter a literature rife with apparent disagreement about fundamental concepts (around which much of the rest of the field seems to be built) are left thinking “What hope do the rest of us have?”

Sander_Greenland · November 13, 2022, 12:02am

I don’t know of a simple, brief answer to your question, other than to say that I find logical and practical shortcomings in almost all discussions of statistical foundations and general applications. Consequently, I have felt forced to conclude that much if not most of the ongoing disputations are driven by psychosocial factors. That includes many passages in mathematical form, a form which is often used to give an unassailable air to claims while obscuring the dubious assumptions and interpretations needed to make the logic go through.

Most writings I see seem biased by the reading and experiences of the authors, the authorities they like, and the body of work they are committed to defend. Lest I be taken as a complete nihilist, however, I will say that I have been impressed with the soundness and depth of some writers on these topics, including George Box, IJ Good, and DR Cox in the last century and Stephen Senn into the present, all of whom recognized the need to avoid dogmatic commitments for or against philosophies and methods (although each had or have their preferences for teaching and practice, as do I). Even some of the more dogmatic writers (like DeFinetti, Lindley, and Royall) provided much worthwhile reading.

Applied experience inevitably must be very limited, e.g., Senn is very experienced in designed experiments whereas I am not. Instead I have had to focus on observational studies, which makes me less inclined to start from randomized designs and more inclined toward using explicit causal models (since randomization is unavailable to rule out uncontrolled causes of treatment). The point is that the applied area of the writer needs to be taken into account when judging why they choose to emphasize what they do.

A danger of authorities is that they are further narrowed by their career commitments to methods (through having developed, taught, or used them). That commitment can make them good at supplying rationales for those methods, but not so reliable as critics of methods, including their own and those they avoid because they don’t like the philosophies they attach to them, or see them as threats to their own methods. They can be especially harsh toward or omit mention altogether of methods that are promoted by those they don’t like personally. Then too, we all tend to overgeneralize from our own inevitably narrow experience and reading, sometimes into embarrassing error (a major hazard for those who act like they are polymaths - or worse, claim to be).

So as a supposed authority subject to all the warnings I just listed and more, I can only advise nonexperts to suspend judgement, avoid commitments about methods and philosophies, and don’t accept any conclusion if you can’t discern and accept all the assumptions and follow the argument that produced it. And of course keep reading, including the papers cited earlier. Among more elementary treatments I’ve coauthored that you may find useful are
Statistical tests, confidence intervals, and power: A guide to misinterpretations. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations on JSTOR
The need for cognitive science in methodology. https://academic.oup.com/aje/article/186/6/639/3886035
Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise | BMC Medical Research Methodology | Full Text
Surprise!
https://academic.oup.com/aje/article/190/2/191/5869593
To curb research misreporting, replace significance and confidence by compatibility. To curb research misreporting, replace significance and confidence by compatibility: A Preventive Medicine Golden Jubilee article - ScienceDirect
Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. https://journals.sagepub.com/doi/full/10.1177/02683962221105904

s_doi · November 13, 2022, 5:04am

Well said! In Medicine, the same applies, as all decision making is inferential and this follows the same argument. This also explains why physicians don’t readily accept any conclusion if they can’t discern and accept all the assumptions and follow the argument that produced it - as in the “individual response” thread recently here at datamethods

ChristopherTong · November 13, 2022, 6:06am

I too am grateful for Dr. Greenland’s comments. It will take me weeks to read everything he linked, but my initial appraisal of his point #1 is that it provides a conceptual explanation for why we should have expected Lock Morgan’s results to be true (that random sampling-based inference is asymptotically equivalent to randomization-based inference). This would also completely defuse the more aggressive statements made by the permutation test people, such as Berger et al’s claim that “We conclude that the frequent preference for parametric analyses over exact analyses is without merit.” It would also negate my use of the term “bait and switch”, which I now see may be unfair. @EMSD’s point remains valid that the textbooks and courses need to actually explain what is going on here, rather than sweeping it all under the rug. This thread, however, has shown how much effort was needed to determine “what is going on here”!

@EMSD is also asking about the interpretation of the p-value. If the two frameworks are indeed isomorphic, then it would seem you get to choose one interpretation or the other depending on the nature of the actual study you’re working on. Anyway that seems to be what most people are actually doing? However, at an operational (study design/ execution) level it remains extremely important to be clear about which framework is intended and to follow it strictly - muddling them together would probably deny you access to either interpretation. (I admit this last is speculation - something that I hope is true, but can’t prove.)

As for nihilism, I have been accused of this on another thread on this forum. Such accusations are a way to shut down discussion and close minds. It makes no sense for someone who is working on real problems, and tentatively finding and offering solutions that others have found useful, to be called a nihilist. But to return to Dr. Greenland’s point, we can learn a lot from the most thoughtful writers on the foundations of our discipline, even if we are not completely persuaded by any of them.