Random sampling versus random allocation/randomization- implications for p-value interpretation

R^3 just pointed out this thread to me. There’s a lot I could write but I think the most crucial points have been made, so I’ll just list a few items as I like to think of them:

  1. The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).

  2. The evolution of my views on the meaning of P-values and CI when there is no identifiable randomizer may be seen in going from
    Randomization, Statistics, and Causal Inference on JSTOR

  3. Despite the isomorphism, in practice sampling (selection into the study) and allocation (selection for treatment) are two distinct operations, each of which may or may not involve randomization. When they both do, some interesting divergences between Fisherian and Neyman-Pearson (NP) testing can arise, e.g., see
    On the Logical Justification of Conditional Tests for Two-By-Two Contingency Tables on JSTOR
    which brings us to…

  4. Permutation tests, exact tests, resampling etc.: These are general methods for getting P-values and CI when we are worried about the usual, simple asymptotic approximations breaking down in practice (which occurs more often than noticed). For these methods, covariate-adjusted P-values and CI can be obtained by resampling residuals from fitted adjustment models.

  5. Nonetheless, in my experience switching from the usual Wald (Z-score) P and CI to likelihood-ratio or better still bias-adjusted score P and CI (as in the Firth adjustment) have always been as accurate as could be obtained without going on to use well-informed priors. Those priors translate into a penalized likelihood function, and the P-values and CI from that function are approximate tail and interval areas in the resulting marginal posterior distributions. This use of Bayes will be more frequency-accurate than ordinary frequentist P and CI (including permutation and “exact” P and CI) when the information in the prior is valid in a specific sense. Which brings up…

  1. Valid objections to Bayes seem to come down to the fact that invalid (misinformed) priors can ruin accuracy, and a fear which I share that “expert” priors are often invalid (typically prejudicially biased). In typical practice, the response of using reference priors comes down to using 2nd-order corrected frequentist P and CI, as in the Firth adjustment (which in logistic regression reduces to the Bayes posterior from a Jeffrey’s invariant prior).
  1. Finally, an important technical point which seems to have been overlooked in most published discussions (including mine): The Karl Pearson/Fisher observed tail-area P-value (their “value of P”) is not always equal to the realization of the random variable that is the minimum alpha for which rejection would occur (the P-value defined from Neyman-Egon Pearson testing). This is so even for the simplest normal-mean interval-testing problem. It happens when frequentist criteria are imposed that sacrifice single-sample coherence for hypothetical long-run optimization, notably uniformly most powerful unbiasedness (UMPU). Failure to notice this divergence has led to logically incorrect claims that compatibility interpretations of Fisherian P-values are incoherent, when the incoherence applies only to NP P-values. These claims thus flag a failure to actually read and understand Fisher and later discussions of P-values, which reject decision-theoretic foundations (and their criteria such as UMPU) in favor of information-theoretic and causal foundations. The conflict goes unnoticed in part because the P and CI from the Fisherian and NP approaches don’t diverge numerically in most everyday applications.

Thank you, Dr.Greenland, for your input to this thread. Since this is a forum for statisticians/epidemiologists, I trust that most other datamethods readers (Masters/PhDs) will be in a better position than I to learn from the points you’re making. Alas, as an MD, I don’t have the training needed to understand them.

I do have a very “high-level” question about point number 1 above, though:

The random-sampling and randomization models are isomorphic (we can translate from one to the other) as can be seen by considering finite-population sampling (which is all we ever really do): Random sampling is random allocation to be in or out of the sample; randomization is random sampling from the total selected experimental group to determine those in the group who will receive treatment. Sampling methods thus immediately translate into allocation methods and vice-versa, although some methods may look odd or infeasible after translation (which may explain why the isomorphism is often overlooked).

Does this mean that you disagree with the authors of the articles linked earlier in this thread (e.g., Hirschauer, Locke Morgan), regarding the ability to extrapolate p value interpretations from a random sampling framework to the random allocation framework that is used in RCTs? Maybe you are, in fact, all agreeing with each other (?) and I just don’t even have enough training to recognize the agreement…

While those with extensive training in statistics might easily understand your statement that “randomization is random sampling from the total selected experimental group to determine those who will receive treatment,” I’m not at all sure that this equivalence would be understood by beginning students [nor am I sure whether this view is aligned with the one being espoused by Dr.Senn in his linked blog above (?)].

I would imagine that it’s very difficult for experts in any field to turn back the clock and try to recall what it’s like to learn their subject starting from no baseline understanding. The potential for confusion among students is great, especially in fields where the same terms are applied slightly differently by different experts and without explanation of how nuanced the interpretation of even a single word can be. My impression is that statistics might have bigger problems than other fields in this regard. Students who encounter a literature rife with apparent disagreement about fundamental concepts (around which much of the rest of the field seems to be built) are left thinking “What hope do the rest of us have?”


I don’t know of a simple, brief answer to your question, other than to say that I find logical and practical shortcomings in almost all discussions of statistical foundations and general applications. Consequently, I have felt forced to conclude that much if not most of the ongoing disputations are driven by psychosocial factors. That includes many passages in mathematical form, a form which is often used to give an unassailable air to claims while obscuring the dubious assumptions and interpretations needed to make the logic go through.

Most writings I see seem biased by the reading and experiences of the authors, the authorities they like, and the body of work they are committed to defend. Lest I be taken as a complete nihilist, however, I will say that I have been impressed with the soundness and depth of some writers on these topics, including George Box, IJ Good, and DR Cox in the last century and Stephen Senn into the present, all of whom recognized the need to avoid dogmatic commitments for or against philosophies and methods (although each had or have their preferences for teaching and practice, as do I). Even some of the more dogmatic writers (like DeFinetti, Lindley, and Royall) provided much worthwhile reading.

Applied experience inevitably must be very limited, e.g., Senn is very experienced in designed experiments whereas I am not. Instead I have had to focus on observational studies, which makes me less inclined to start from randomized designs and more inclined toward using explicit causal models (since randomization is unavailable to rule out uncontrolled causes of treatment). The point is that the applied area of the writer needs to be taken into account when judging why they choose to emphasize what they do.

A danger of authorities is that they are further narrowed by their career commitments to methods (through having developed, taught, or used them). That commitment can make them good at supplying rationales for those methods, but not so reliable as critics of methods, including their own and those they avoid because they don’t like the philosophies they attach to them, or see them as threats to their own methods. They can be especially harsh toward or omit mention altogether of methods that are promoted by those they don’t like personally. Then too, we all tend to overgeneralize from our own inevitably narrow experience and reading, sometimes into embarrassing error (a major hazard for those who act like they are polymaths - or worse, claim to be).

So as a supposed authority subject to all the warnings I just listed and more, I can only advise nonexperts to suspend judgement, avoid commitments about methods and philosophies, and don’t accept any conclusion if you can’t discern and accept all the assumptions and follow the argument that produced it. And of course keep reading, including the papers cited earlier. Among more elementary treatments I’ve coauthored that you may find useful are
Statistical tests, confidence intervals, and power: A guide to misinterpretations. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations on JSTOR
The need for cognitive science in methodology. https://academic.oup.com/aje/article/186/6/639/3886035
Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise | BMC Medical Research Methodology | Full Text
To curb research misreporting, replace significance and confidence by compatibility. To curb research misreporting, replace significance and confidence by compatibility: A Preventive Medicine Golden Jubilee article - ScienceDirect
Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. https://journals.sagepub.com/doi/full/10.1177/02683962221105904


Well said! In Medicine, the same applies, as all decision making is inferential and this follows the same argument. This also explains why physicians don’t readily accept any conclusion if they can’t discern and accept all the assumptions and follow the argument that produced it - as in the “individual response” thread recently here at datamethods

I too am grateful for Dr. Greenland’s comments. It will take me weeks to read everything he linked, but my initial appraisal of his point #1 is that it provides a conceptual explanation for why we should have expected Lock Morgan’s results to be true (that random sampling-based inference is asymptotically equivalent to randomization-based inference). This would also completely defuse the more aggressive statements made by the permutation test people, such as Berger et al’s claim that “We conclude that the frequent preference for parametric analyses over exact analyses is without merit.” It would also negate my use of the term “bait and switch”, which I now see may be unfair. @EMSD’s point remains valid that the textbooks and courses need to actually explain what is going on here, rather than sweeping it all under the rug. This thread, however, has shown how much effort was needed to determine “what is going on here”!

@EMSD is also asking about the interpretation of the p-value. If the two frameworks are indeed isomorphic, then it would seem you get to choose one interpretation or the other depending on the nature of the actual study you’re working on. Anyway that seems to be what most people are actually doing? However, at an operational (study design/ execution) level it remains extremely important to be clear about which framework is intended and to follow it strictly - muddling them together would probably deny you access to either interpretation. (I admit this last is speculation - something that I hope is true, but can’t prove.)

As for nihilism, I have been accused of this on another thread on this forum. Such accusations are a way to shut down discussion and close minds. It makes no sense for someone who is working on real problems, and tentatively finding and offering solutions that others have found useful, to be called a nihilist. But to return to Dr. Greenland’s point, we can learn a lot from the most thoughtful writers on the foundations of our discipline, even if we are not completely persuaded by any of them.


I fail to see how that follows from what Sander wrote. The quote from Ernst in the Permutation Methods paper (who was also a co-author of the initial paper I linked to) described the problem: parametric assumptions that require large sample properties to hold for our finite, often very small sample, but we don’t know how quickly we converge to the limit.

Philip Good in his book Permutation, Parametric, and Boostrap Tests of Hypotheses (3rd ed, p. 153-154) gave a real world example (categorical data examined via Chi-square statistics) where the permutation p values and the large sample approximations differed by a factor of 10! The permutation test detected differences that asymptotic approximations did not.

I think it is unfortunate permutation methods are invariably perceived as tests; interval estimates and entire compatibility distributions (more commonly called “confidence” distributions) can be created from them. In the case of RCTs they substitute an unverifiable assumption with a design based proposition that can be taken as correct by construction (if you accept the research report as described).

Related article that @ESMD might appreciate:


I see this as a separate argument in favor of randomization procedures - they behave well for finite sized data sets compared to asymptotic methods (which is not at all surprising), in situations when the former even exist. This is a different argument than the “match your analysis to your design” principle that has consumed much of the oxygen on this thread.

Nonetheless, the Berger et al quote fails on this line of reasoning as well. Consider the example of stratified risk ratio, mentioned earlier. There is no exact confidence interval for it that has been published (at least last I checked). Others have offered at least two solutions for the asymptotic inference including Greenland and Robins (1985) and Gart and Nam (1988). One cannot simply dismiss these as “without merit” when the competing randomization-based inference isn’t even known to exist. This gets back to my point about nihilism - let’s appreciate people who are actually trying to solve these problems, rather than just those who wave around the one-true-ideology.

Some of the promoters of permutation “methods” are themselves responsible for this fallacy. The seminal Edgington/Onghena book I cited is titled Randomization Tests. See also the title of the Philip Good book cited by @R_cubed. However, his point remains sound - as alluded to above, I’m familiar with this part of the literature, but failed to make the connection earlier on this thread, since much of the polemics deals with p-values, as we’ve seen. I stand corrected.


The George Cobb paper posted by @R_cubed is very intriguing. Cobb was right in calling out our profession for exactly what @ESMD noticed - failing to justify to students what I called the “bait and switch” between two inferential frameworks. That part of his critique remains sound today. However it is only in retrospect (his paper appeared 10 years before Lock Morgan’s) that we can see that Cobb’s proposed solution need not be the only one. Indeed the Lock family themselves (see their intro textbook I cited above) breaks free of the traditions Cobb was upset with, by centering their text on both bootstrap and randomization inference concepts, not soley on randomization inference, as Cobb would have preferred. I do not know if the Lock^5 book directly addresses the justification for “bait and switch” but given Lock Morgan’s research on this very topic, I wouldn’t be surprised if they did. I don’t necessarily endorse the Lock^5 book (I’ve never actually inspected a copy) but present it as one possible alternative to Cobb’s proposal; I’m sure others could be imagined.

Finally, on a different thread @Pavlos_Msaouel reminded us that Piantadosi’s Clinical Trials (3d ed., Wiley), sec. 2.2.4, provides additional relevant discussion of the topic of this thread.


I agree about the George Cobb paper linked above- he was a great communicator. He provides a “big picture” view of frequentist inference and explains how and why stats came to be taught the way it has been for so many years. I sure wish that someone had presented this perspective back in 1990 as I suffered through an eye-wateringly confusing undergraduate stats course…

A few excerpts that stood out for me:

“My thesis is that both the content and the structure of our introductory curriculum are shaped by old history. What we teach was developed a little at a time, for reasons that had a lot to do with the need to use available theory to handle problems that were essentially computational… Intellectually, we are asking our students to do the equivalent of working with one of those old 30-pound Burroughs electric calculators with the rows of little wheels that clicked and spun as they churned out sums of squares.

Now that we have computers, I think a large chunk of what is in the introductory statistics course should go… Our curriculum is needlessly complicated because we put the normal distribution, as an approximate sampling distribution for the mean, at the center of our curriculum, instead of putting the core logic of inference at the center…”

“… First, consider obfuscation. A huge chunk of the introductory course, at least a third, and often much more, is devoted to teaching the students sampling distributions…Why is this bad? It all depends on your goal. The sampling distribution is an important idea, as is the fact that the distribution of the mean converges to a normal. Both have their place in the curriculum. But if your goal is to teach the logic of inference in a first statistics course, the current treatment of these topics is an intellectual albatross…”

“…There’s a vital logical connection between randomized data production and inference, but it gets smothered by the heavy, sopping wet blanket of normal-based approximations to sampling distributions…”

“…We may be living in the early twenty-first century, but our curriculum is still preparing students for applied work typical of the first half of the twentieth century…”

“…What I have trouble getting down my persnickety throat is the use of the sampling model for data from randomized experiments. Do we want students to think that as long as there’s any sort of randomization, that’s all that matters? Do we want them to think that if you choose a random sample but don’t randomize the assignment of treatments, it is still valid to conclude that the treatment caused the observed differences? Do we want them to think that because the assignment of treatments was randomized, then they are entitled to regard the seven patients in the example as representative of all patients, regardless of age or whether their operation was a tonsillectomy or a liver transplant? Do we want students to leave their brains behind and pretend, as we ourselves apparently pretend, that choosing at random from a large normal population is a good model for randomly assigning treatments?..”

“… I’ve become convinced that a huge chunk of statistical theory was developed in order to compute things, or approximate things, that were otherwise out of reach. Until very recently, we had no choice but to rely on analytic methods. The computer has offered to free us and our students from that, but our curriculum is at best in the early stages of accepting the offer…”

“…We need a new curriculum, centered not on the normal distribution, but on the logic of inference. When Copernicus threw away the old notion that the earth was at the center of the universe, and replaced it with a system that put the sun at the center, his revolution brought to power a much simpler intellectual regime. We need to throw away the old notion that the normal approximation to a sampling distribution belongs at the center of our curriculum, and create a new curriculum whose center is the core logic of inference.

What is that core logic? I like to think of it as three Rs: randomize, repeat, reject. Randomize data production; repeat by simulation to see what’s typical and what’s not; reject any model that puts your data in its tail.

*The three Rs of inference: randomize, repeat, reject

1. Randomize data production

• To protect against bias
• To provide a basis for inference

- random samples let you generalize to populations
- random assignment supports conclusions about cause and effect

2. Repeat by simulation to see what’s typical

• Randomized data production lets you re-randomize, over and over, to see which outcomes are typical, which are not.

3. Reject any model that puts your data in its tail…”


Brilliant. Thanks so much posting this, which I had not seen before.

I agree that this is a good way to understand frequentist inference, but I think it doesn’t do justice to the Bayesian perspective that places conditioning on what is known as fundamental.

It is easy to generate useless procedures with frequency guarantees. Sensible use of frequentist methods rely on the likelihood principle in the background. For example:

I don’t know if it is possible, but I wish there was a course that took the perspective of Herman Chernoff – develop an initial Bayesian decision theoretic perspective, then develop frequentist procedures guided by those constraints.

Could you link the podcast?

1 Like

The original podcast I was referring to in this thread was accidentally not saved. Rumor has it that this topic was so boring that everyone fell asleep and forgot to press “save” :slight_smile:

We thus did a second take here (Titled “Episode 207: Editorial on COSMIC-313”). We thankfully record this one. The topic is still the same, ie, explain some of the concepts described in my editorial here on this trial. But this time the discussion was much less technical and I do not think I even mentioned random sampling vs random allocation anywhere.

However, this more recent video interview does go into more details on this and other topics related to patient care and research. It was motivated by a real-life extremely unusual case we treated that was popularized here. Additional discussion with the actual patient here.


Directly related and linking to this discussion is my just published commentary at the Harvard Data Science Review on the role of sampling in medicine.

This is in response to the recent paper by Michael Bailey on strategies to improve political forecasting. The ideas he discusses (including in his forthcoming book) have implications across science, including medicine where we are similarly tasked to forecast outcomes at the population and individual levels based on best available data and tools.


Hope @Stephen will be able to look at Bailey’s paper.

1 Like

Tremendously looking forward to his insights!

Hi Pavlos

Thanks- really interesting! The Bailey paper discusses how the “Meng equation” might be useful to improve the accuracy of election polling, accounting more thoroughly for the non-random nature of polls. Many people who answer calls from pollsters will simply hang up. A key problem is that the willingness of people to identify their political affiliation when contacted randomly by pollsters can be correlated with the opinion itself. If a higher proportion of people who hang up will end up voting for political party A versus party B, the poll will be biased.

Bailey notes that true “random sampling” (which occurs when most people who are contacted randomly agree to respond) is the ideal type of sampling for election polling. However, true random sampling virtually never occurs these days because so few people who are randomly contacted are willing to actually respond. Instead, the process that ends up occurring is better described as “random contact.”

Since medical RCTs also deal with “non-representative” samples (specifically, they use “convenience” samples), it’s reasonable to ask whether ideas to improve election polling accuracy could somehow be extrapolated to medical research.

A few questions/comments about your paper, the Bailey paper, and the Meng 2018 paper:

  1. From Box 1 in your article:

“Medical RCTs primarily focus on making comparative causal inferences applicable not to an external population but to all possible replications of the study with similar sampling patterns .”

Can you elaborate on the meaning of the bold phrase?

  1. Also from your paper:

“Bailey (2023) also shows the value of other strategies such as random contact, which produce nonrandom samples but have inferential advantages compared with standard convenience sampling .”

I didn’t see where Bailey discussed convenience sampling (i.e., the type of sampling used in RCTs) in his paper. Your phrasing above might not be implying that he did mention convenience sampling, but rather that you are the one who is contrasting convenience sampling and with random contact (?)

In RCTs, we certainly don’t have random sampling (the ideal scenario in the world of election polling), but neither do we have random contact (?) For example, in order to collect a truly random sample of patients with colon cancer, for inclusion in an RCT, we would need a “master list” of all the patients (?in the world) with colon cancer- this would be our overall “population.” We would then need to be able to randomly “pluck” patients off this list, assigning some to the intervention arm of our trial, and others to the control arm. And in this ideal scenario, everybody who was “plucked” would agree to participate in the trial. As we know, this never occurs. In order for the random contact scenario to apply to RCTs, we’d still need the master list of cancer patients that we could use to perform random contact, offering the chance to participate in the trial- in this scenario though, some might decline to participate; only those who agreed to participate would be enrolled. But again, if there’s no master list, there can be no possible random contact (?)

So where does this leave us with the Meng equation? Are you suggesting that the equation could be applied to the results of RCTs, in order to gauge the degree to which the trial’s result (i.e., the between-arm comparison) might be generalizable (?is this the correct term) to the overall population of patients with the disease in question? Or are you wondering if the result could be useful to justify “transporting,” the RCT result to patients who would not have fulfilled inclusion criteria for the trial?

I’m very hazy on what the “R” is supposed to represent in the “data defect correlation” term of the Meng equation. Bailey described it as the “Response Mechanism.” The 2018 Meng 2018 paper (which looks like it requires a PhD in statistics to understand) describes it as follows:

“Here the letter “R”, which leads to the R-mechanism, is used to remind ourselves of many possible ways that a sample arrived at our desk or disk, most of which are not of a probabilistic sampling nature. For Random sampling, R ≡ {R1 , . . . , RN } has a well-specified joint distribution, conditioning on the sample size 􏰃Nj =1 Rj = n. This is the case when we conduct probabilistic sampling and we are able to record all the intended data, typically unachievable in practice, other than with Monte Carlo simulations (see Section 5.3).

For many Big Data out there, however, they are either self-Reported or administratively Recorded, with no consideration for probabilistic sampling whatsoever. Even in cases where the data collector started with a probabilistic sampling design, as in many social science studies or governmental data projects, in the end we have only observations from those who choose to Respond, a process which again is outside of the probabilistic sampling framework. These “R-mechanisms” therefore are crucial in determining the accuracy of Gn as an estimator of GN ; for simplicity, hereafter the phrase “recorded” or “recording” will be used to represent all such R-mechanisms.”

I cringe while saying this (because it’s probably completely off base), but I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term (??)

Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?

Apologies in advance if I’ve completely lost the plot here…

1 Like

Great questions. Appreciate the opportunity to elaborate further. Briefly:

Blockquote Can you elaborate on the meaning of the bold phrase?

This refers to the features of the hypothetical population that is being sampled, including the centers/locations participating in the RCT, the time period, the willingness of patients to consent, eligibility criteria (especially exclusion criteria) etc. This population will by definition be different than the population that the RCT inference will be used for (for example your patients in clinic). Even if all else is equal, the time period will be different. This is a key argument in favor of concurrent controls in RCTs.

The data defect correlation in the Meng equation (ρR,Y) is useful when thinking about all non-probability samples, including convenience sampling. It is true that my commentary provides a distinct viewpoint that supplements the Bailey article by approaching these methodologies, originally developed for social science applications, from the perspective of experimentalists. The challenges in generalizing and transporting inferences apply to RCTs as much as in other fields.

Blockquote I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term

R denotes whether a patient would be selected into the sample. In the Bailey paper, when R = 1 then a person would be selected into the sample because they would be willing to respond to the poll. When R = 0 then a person would not be willing to respond to the poll and thus, by definition, we cannot know the outcome Y of that person based only on the specific polling study. It will have to be supplemented by external information. When R correlates with Y then the data defect correlation ρR,Y is different than zero and thus contributes to the error (Yn - YN) between the sample mean and population mean.

Random sampling converts ρR,Y to approximately zero thus essentially nullifying Yn - YN. But in non-probability sampling, including convenience sampling, ρR,Y can be different than zero thus increasing Yn - YN. In polling, this can happen when individuals who refuse to answer a poll are more likely to vote for a specific response Y. In medicine, this can happen when individuals who either refuse or cannot enroll in an RCT are more likely to have a good or bad outcome Y.

Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?

While n applies to the sampled population, I propose to choose N depending on the population we wish to make group-specific inferences for.


Thanks for those clarifications. I guess a key question is how we figure out the value of “R” when we’re talking about a convenience sample…

1 Like

It will not be easy. Some believe it is impossible. But we have to at least try to attack this problem using our best available and emerging tools. I believe the data defect correlation can be one useful framework towards this goal.

My commentary had two key goals:

  1. To highlight that this is a challenge that is certainly not solved by providing standard group-specific estimates from RCTs. This is uncontroversial in the methodology world but not necessarily communicated enough in medical practice. In oncology, we have even reached the point of having high-profile “randomized non-comparative” trials being published. This is an oxymoron to methodologists, similarly to conducting a randomized non-randomized trial, telling a true myth or sharing an original copy.

  2. To suggest exploring the frameworks being developed around the data defect correlation towards the goal of generalizing and transporting RCT inferences.

1 Like