Random sampling versus random allocation/randomization- implications for p-value interpretation

ESMD · November 6, 2022, 9:30pm

The questions below are very basic and have probably been asked thousands of times by statistically-naive MDs- apologies in advance. Unfortunately, I keep getting hung up on them and can’t find clear answers in my reading. I’ll understand if there’s no way to explain these concepts in layman’s terms.

1. What exactly does a p value mean in the context of an RCT, given that “random sampling” from an underlying population of interest has not occurred?

This is the definition of a p-value provided in section 5.4 of the BBR text (my emphasis in bold):

“A P-value is something that can be computed without speaking of errors. It is the probability of observing a statistic as or more extreme than the observed one if H0 is true, i.e., if the population from which the sample was randomly chosen had the characteristics posited in the null hypothesis. The P-value to have the usual meaning assumes that the data model is correct.”

The following article discusses inferential problems that arise when random sampling has not occurred:

“When p-values or confidence intervals are displayed, a plausible argument should be given that the studied sample meets the underlying probabilistic assumptions, i.e. that it is or can be treated as a random sample. Otherwise, there are no grounds for using these inferential tools and they become essentially uninterpretable…”

I’m confused…We are never actually using a “random sample” of patients when we conduct an RCT, yet p values are found throughout trial reports. For example, if we want to study the effect of a new chemotherapy drug in patients with colon cancer, we will interview patients with colon cancer as they happen to present for medical care (of their own volition). Only if they meet the inclusion criteria for the trial will we next offer them the chance to be randomized. In turn, only those who agree to participate in the trial will be randomized, either to the new therapy (whose intrinsic efficacy is being tested) or to placebo.

As discussed in this blog, “random sampling” does not occur in the conduct of human experiments. Rather, what occurs is “randomization”- a very different process:

Using the clinical example above, random sampling would require that doctors randomly “pluck” a sample of patients with colon cancer from a master list of ALL patients with colon cancer and then randomly allocate them to one treatment or another. But this isn’t remotely how clinical research works, for the following reasons:

there is no “master list” of all patients in the country (or world) with colon cancer;
even if there were such a master list, we wouldn’t be able to just pluck patients randomly from the list and force them to enter a clinical trial of a new therapy;
many patients with colon cancer live in parts of the world where clinical trials are not conducted.

Some non-clinical people/non-statisticians seem to be under the mistaken impression that random sampling occurs in the design/conduct of RCTs. Could this misunderstanding be rooted in the fact that 1) p values are used widely in the interpretation of RCT results, and 2) the concept of “random sampling” seems to be built into the definition of a p-value?

Given that “random sampling” never actually occurs in the design and conduct of an RCT, how should we define/interpret p values in the RCT context?

2. On a somewhat related note, how exactly is the concept of repetition defined in frequentist statistics?

The following sentence is an excerpt from this piece: Statistical Thinking - My Journey from Frequentist to Bayesian Statistics

“I came to not believe in the possibility of infinitely many repetitions of identical experiments, as required to be envisioned in the frequentist paradigm.”

Given the importance of distinguishing random sampling from random allocation, how should the concept of “multiple hypothetical repetitions” in the frequentist paradigm be viewed? Should we view hypothetical repetitions as involving:

repeat draws of random samples from an underlying population, followed by random allocation of subjects in each sample to one treatment or another? (i.e., multiple experiments conducted on multiple random samples of subjects)?; OR
single non-random recruitment of a group of subjects, followed by repeated randomization of these same subjects to one treatment or another (i.e., multiple experiments conducted on the same group of subjects)?; OR
repeated non-random recruitment of different groups of subjects, followed by random allocation of each group of subjects to one treatment or another (i.e., multiple experiments conducted on multiple non-random samples of subjects)?

R_cubed · November 6, 2022, 9:55pm

I remember posting a link that discusses this in a very old thread:

The textbooks discuss the distinction between model based inference (ie. 2 sample t-test), vs design based inference (again either 2 sample t test or permutation test). The design gives justification to the calculation of a p value in an RCT.

Pavlos_Msaouel · November 6, 2022, 10:03pm

Just did a podcast on the topic aimed towards clinicians, which should come out soon. Briefly, random sampling physically licenses the use of measures of uncertainty such as standard errors (SEs), confidence intervals (CIs) and p-values for (sub)groups of sampled patients. Conversely, random allocation physically licenses the use of measures of uncertainty for the differences between the allocated groups. Measures of uncertainty are used when we are making inferences. When those are not expected to be valid then we can use descriptive measures and not do inferences.

For example, in an RCT the random treatment allocation licenses the use of measures of uncertainty for hazard ratios, odds ratios, risk ratios, median/mean survival difference, absolute risk reduction etc that measure differences between groups. Because there is no random sampling, measures of uncertainty are not licensed by the randomization procedure for cohort-specific estimates such as the median survival observed in each treatment cohort. For those, we can use descriptive measures such as standard deviation (SD), interquartile range etc. Measures of uncertainty will require further assumptions to be considered valid. Further discussion here.

ESMD · November 6, 2022, 11:33pm

Thanks Pavlos. I like the way you have explained things above- very clear as usual. I’d be interested to listen to your podcast when it comes out- maybe you can link to it here (?)

My difficulty grasping these concepts surely stems from my limited formal training in stats/epi. Having pleaded ignorance though, I will say that formal definitions of statistical terms leave a lot to be desired, often confusing rather than enlightening beginning students. That idea of “random sampling” seems pretty engrained in the p value definition- is it any wonder that nonclinical people sometimes seem confused about how RCTs work?

Pavlos_Msaouel · November 6, 2022, 11:42pm

Yup, will link to the podcast when it is published.

It is randomization and not only random sampling that was a key component of traditional frequentist inferences in the 20th century. Randomization may be random sampling or random allocation. The latter can be either natural (as in Mendelian randomization) or experimental as in RCTs. There is a very nice recent overview of the frequentist, Bayesian and fiducial paradigms here. Also, controversies regarding how much we can rely solely on the randomization procedure to license statistical inferences existed from the very beginning between Fisher and Neyman as nicely recounted here.

ChristopherTong · November 7, 2022, 12:05am

A useful lens I have used in thinking about this topic is the distinction between internal and external validity (Campbell, 1957 and later publications with coauthors). Internal validity refers to the ability to attribute a causal link to the intervention, by managing sources of confounding (random allocation, blinding, and concurrent control being some of the tools to accomplish this). External validity refers to the ability to generalize from a sample to the population. Random sampling is a mechanism to achieve this. In the clinical trials literature, similar distinctions have been made (eg, Schwartz & Lellouch 1967, who refer to explanatory vs pragmatic, respectively). See also

Turning to the question at hand:

Mead, et al. (2012) write that “In designed experiments, the very careful control which is exercised often makes it difficult to identify a population for which the sample is relevant.” They raise the question “of whether the conclusions from such rigidly controlled experiments have any validity for future practical situations. This is a difficult question, which extends beyond statistics” (p. 233).

Similarly, Simon (2006) writes that “Researchers, trying to minimize variation, will use exclusion criteria to create more homogeneous groups…If it is difficult to extrapolate results from a very tightly controlled and homogeneous clinical trial to the variation of patients seen in your practice, then the research has limited value to you” (p. 39).

Senn (2007) seems to echo @Pavlos_Msaouel . He writes, “In experiments, provided that we obtain suitable material, we rarely worry about its being representative. Inference is comparative and…this is generally the appropriate attitude for clinical trials” (p. 40).

I credit the randomization test community for thinking carefully about this issue, as @R_cubed alluded to already. They seem to have provided the most uncompromising answer I’ve seen so far. Edgington and Onghena (2007) state that “statistical inferences about treatment effects must be restricted to the subjects (or other experimental units) used in an experiment” (p. 8) due to the inability to randomly sample subjects from the population. “Inferences about treatment effects for other subjects must be nonstatistical inferences-inferences without a basis in probability. Nonstatistical generalization is a standard scientific procedure. We generalize from our experimental subjects to individuals who are quite similar in those characteristics that we consider relevant.” Finally, “the main burden of generalizing from experiments always has been, and must continue to be, carried by nonstatistical rather than statistical logic.”

Finally, one could push the questionable logic of statistical inference all the way off the cliff. Berk & Freedman (2005) wrote that “inferences to imaginary populations are also imaginary.”

Historically, Fisher’s century old classic (Fisher 1922) founded statistical inference on the random sampling from a population model, but he also was an advocate of random assignment in designed experiments (see his 1935 book Design of Experiments and other writings). I think Austin Bradford Hill was the first to take these ideas into the clinical trials arena.

References:

R. Berk and D. A. Freedman, 2005: Statistical assumptions as empirical commitments. In
Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2d ed., edited
by T. G. Blomberg and S. Cohen (Aldine, 235-254).

D. T. Campbell, 1957: Factors relevant to the validity of experiments in social settings.
Psychological Bulletin, 54: 297-312.

E. S. Edgington and P. Onghena, 2007: Randomization Tests. Fourth edition. Chapman & Hall/CRC.

R. A. Fisher, 1922. On the Mathematical Foundations of Theoretical Statistics, Philosophical Transactions of the Royal Society of London A, 222, 309–368.

R. Mead, S. G. Gilmour, and A. Mead, 2012: Statistical Principles for the Design of Experiments: Applications to Real Experiments. Cambridge University Press.

D. Schwartz and J. Lellouch, 1967: Explanatory and pragmatic attitudes in therapeutic trials. Journal of Chronic Diseases, 20: 637-648. Reprinted in Journal of Clinical Epidemiology, 62: 499-505.

S. Senn, 2007: Statistical Issues in Drug Development. Second edition. Wiley.

S. D. Simon, 2006: Statistical Evidence in Medical Trials: What do the Data Really Tell Us? Oxford University Press.

f2harrell · November 7, 2022, 1:20pm

What an amazing discussion. I hope @Stephen Senn will join it. I believe he’s stated that p-values are appropriate for non-random samples when you are making relative comparisons. Historically, agricultural field experiements provided much of the motivation for frequentist statistics, and plots of land are not selected at random from the face of the earth.

Once again Bayesian thinking cuts through a lot of complexity. With Bayes one considers the uncertainty distribution of the treatment effect that generated the dataset at hand. The study design is what makes you possibly able to assume that the data-generating-effect is a generalizable treatment effect for other samples or not.

Harry_TB · November 7, 2022, 2:27pm

I have found @Stephen Senn’s talks on randomisation in clinical trials to be especially insightful:

ESMD · November 7, 2022, 2:27pm

Randomization may be random sampling or random allocation.

The phrase in block quotes above serves to justify the fact that we use p values in RCTs. But Dr.Senn’s blog piece linked above emphasizes how important it is to not confuse random sampling with random allocation. In the first instance (random sampling) we seem to have access to the entire population of potential interest, whereas in the second instance (randomization that occurs in the context of an RCT), we are using a convenience sample that has presented for medical care and is willing to participate in a trial, and then assigning treatments “at random” to this sample.

If it’s kosher to use p values in the second instance (i.e., convenience sample/RCT context), even though the formal p value definition invokes the idea of random “sampling,” then shouldn’t this be stated clearly and with justification in introductory stats texts?

I suspect hat much of the confusion among physicians learning critical appraisal stems from the fact that our teachers are often (?usually) practising MDs, who might not understand the historical foundations of statistics well enough to address questions like this. As a result, these types of questions get glossed over.

R_cubed · November 7, 2022, 3:51pm

I’ve found @Sander_Greenland many papers on this issue helpful. I’ve posted a number of open access links to his papers in this thread. For example:

A p-value, from a pure logical perspective, is a continuous, quantitative measure of compatibility or surprise from an asserted, conjectural value. After data collection, it is a percentile in the hypothetical distribution. Where this hypothetical distribution comes from depends on the context. It can be from a population model, or as you mention in 2, second bullet point above

Blockquote
single non-random recruitment of a group of subjects, followed by repeated randomization of these same subjects to one treatment or another (i.e., multiple experiments conducted on the same group of subjects)?

This is exactly how permutation tests work. Under a strong default assumption of no effect (and an assumption the groups are exchangeable), re-label the data to compute an empirical no effect (null) distribution, and compare your observed results to this distribution. There are no modelling assumptions here (the assumptions are accepted as true because they are constructed that way by design) which makes it a powerful method of detecting effects.

Pavlos_Msaouel · November 7, 2022, 5:13pm

Random sampling and random allocation are randomization procedures that respectively demarcate two distinct great statistical fields: sampling theory and experimental design. Valid p values can in theory be estimated in both these fields. As @ChristopherTong pointed out, experimental design focuses primarily on internal validity, whereas sampling theory is more interested on external validity.

Say we want to know the average hemoglobin A1c (HbA1c) levels in the population of people living in Houston. It is technically not feasible to measure HbA1c in all 2.3 million people living there. Instead we can randomly sample a representative group of individuals (called a “sample”) and measure their HbA1c levels. We can describe the values for the sample using descriptive measures (standard deviation, interquartile range, min-max range etc) but can also make inferences from this sample about the whole population. For example, we can test the hypothesis from this sample that the average HbA1c of people living in Houston is greater than 7.0% and generate a p-value and frequentist confidence intervals.

Say now that we want to know whether a new drug A changes HbA1c levels compared with a control drug B. Our tested hypothesis here can be the null hypothesis that drug A and drug B do not differ in their effect on HbA1c. We take a group of patients and randomly assign them to either drug A or drug B thus creating two patient cohorts: those assigned on drug A and those on drug B. We can use here the term “cohorts” instead of patient “sample” to distinguish random sampling from random allocation. The random allocation (random treatment assignment) licenses us to generate valid p-values for inferences regarding differences between cohort A and B. This is how p-values testing the null hypothesis of no difference are justified in RCTs.

In the frequentist paradigm, we have a fixed estimand θ which can generate a distribution of data Xn (X1, X2… Xn). The random procedure (random sampling or random allocation) allows us to assume a random distribution of the data Xn in the long-run. We can then test how compatible are the observed data Xn=x with this expected distribution. This is where the concept of likelihood comes from.

In the Bayesian paradigm, it is the observed data X that are fixed and the distribution θn (θ1, θ2…θn) is stochastic. To assign a probability distribution to θn = θ (i.e., some value of θ that is of interest) we need to first start with a prior distribution for θn.

In a third paradigm called fiducial inference, largely abandoned but still potentially useful in developing data analysis intuition, neither θn nor Xn are fixed but rather it is assumed that each dataset Xn corresponds to the estimand θn and thus θ1 generates X1, θ2 generates X2 and so on.

ChristopherTong · November 8, 2022, 7:39am

I share @ESMD’s worry that this type of discussion is “glossed over” when users and consumers of statistical inference are taught. I strongly suspect they are even glossed over in the training of many professional statisticians - they certainly were when I received my statistics education! So far, none of us on this thread has yet cited an intro stats book discussion of these seemingly important points…?

f2harrell · November 8, 2022, 11:42am

Perhaps the main point is easier to see in a crossover study. Take the case of a randomized A-B crossover. Within-subject B-A differences are assumed to be exchangeable with the differences one would observe had a random sample of subjects from a population been studied.

ESMD · November 9, 2022, 2:50am

Thanks to everyone for their responses. I really want to be sure that I understand what you’re saying here. Are you saying that the phrase *“infinitely many repetitions of identical experiments,” is referring, when we use frequentist statistics to interpret RCT results, to the hypothetical distribution of between-arm differences in outcome that we would see if we were to simply re-randomize the same group of subjects (i.e., the single group of patients who agreed to be part of our trial) over and over again to one treatment arm or the other?

My understanding (maybe incorrect?) is that the concept of “repetition” is invoked because we are trying to somehow make sense of the results of the single experiment/RCT that we have just conducted. Specifically, we want to know how “surprised” we should be when looking at the result of our experiment. We start by saying to ourselves: "If this treatment that we are testing is a dud, then, if we had the ability to repeat this experiment (with these same subjects) many times, we would see a certain distribution of between-arm differences in outcome. Once we agree on the general characteristics of this hypothetical “dud” distribution, we now have a frame of reference for interpreting the result of the single experiment that we have just conducted (?) If the between-arm difference in outcome that we have just observed in our single experiment/RCT is of a sufficiently large magnitude, then it will fall in one of the “tail ends” of our hypothetical “dud” distribution- the p value for the RCT reflects where in the hypothetical “dud” distribution our particular experimental result lies.

I hope I haven’t mangled this too much…But if this is a reasonable description of what a p value means in the context of interpreting an RCT result, then why does the concept of “random sampling” even have to be part of the formal definition of a p value? Doesn’t attaching the term “sampling” just create confusion among beginning students, given that it seems kosher to use and interpret p values even in the absence of random sampling?

Addendum- I just came across this article, which seems to speak directly to the questions in the original post. [1708.02102] Reallocating and Resampling: A Comparison for Inference

Here are a couple of excerpts:

While reallocating originated for testing after random allocation and resampling originated for estimation after random sampling, both methods are widely used often irrespective of the data collection method or inferential goal. Moreover, while authors are in general agreement about the use of reallocating for tests based on experiments and the use of resampling for estimation on random samples, there is discrepancy as to when (or whether) it matters, and whether the data collection method or the inferential goal should take priority if the two don’t align. To the best of our knowledge, this appears to be a void in the literature that deserves to be addressed…

…Many note that random allocation and random sampling lead to two fundamentally different modes of inference, given different names by different authors: experimental versus sampling inference (Kempthorne, 1979), randomization versus population inference (Ludbrook, 1995), finite sample versus super population inference (Imbens and Rubin, 2015), and permutation versus population model (Berry et al., 2014). The former, stemming from random allocation and addressing only the sample at hand, originated with Fisher (1935, 1936). The latter, stemming from random sampling and generalizing to a larger population, originated with Neyman and Pearson (1928). Kempthorne (1979) argues that it is misleading to refer to both using the same single word “inference”. Here we are explicit about this distinction in our notation.

As usual, the mathy bits of stats papers are incomprehensible to me . But my hazy impression, after reading the narrative sections, is that the concept of “random sampling” maybe shouldn’t be presented in introductory stats texts as intrinsic to the definition of a p value (?)…And maybe somebody really smart should someday translate the historical references and key messages from this paper into a language that beginning students can understand?

ChristopherTong · November 9, 2022, 7:06am

The random sampling model of statistical inference is actually applicable to major areas of statistics, such as survey methodology, statistical quality control, ecological abundance studies, etc…just not to RCTs and many designed experiments. (There are types of epidemiological studies such as medical surveys that are closer to the random sampling model though.) Unfortunately the standard textbooks tend to introduce statistical inference using only the random sampling model, and never get around to explaining the random assignment model and the different interpretation that attends it. Thus it seems to me that the textbooks are carrying out a bait-and-switch. Why? Is it because most stat methods (except for randomization tests/permutation tests/exact tests) are derived from a random sampling paradigm? Is it even possible to recast all such derivations in a random assignment setting?

In actual RCTs people don’t often use randomization tests, whose interpretation fits the random assignment design of the study, but rather we use conventional frequentist methods that are derived from random sampling concepts. Edgington and Onghena (cited above) seem to suggest that we do this in order to approximate the p-value from the randomization test that we should have done but didn’t, thus the accompanying random sampling interpretation of the resulting p-value isn’t relevant. The Berger et al paper posted by @R_cubed (in his initial post on this thread) seems to be arguing along similar lines.

It seems to me that promoters of randomization/permutation/exact tests (Lehmann’s 1975 nonparametrics book might be a rare exception) are the most willing to even broach this topic. Do those who are not in the permutation test camp agree that our methods are just approximating their methods? And if so, why don’t our courses and books admit it? And if not, is it actually possible to derive all conventional frequentist methods purely from a random assignment paradigm? If yes, why aren’t approximatey half our textbooks doing that instead of random sampling? Now maybe I’m the one who is confused

f2harrell · November 9, 2022, 12:06pm

But it needs to be pointed out again how out of place a sampling frame of mind is for the problems you mentioned other than surveys. Bayes’ attempt to reveal hidden data generating mechanisms is the more applicable approach and is much, much cleaner in terms of understanding concepts IMHO.

R_cubed · November 9, 2022, 3:41pm

This is an interesting question, because even if one is an ardent supporter of permutation tests for RCTs, for a large enough sample, exhaustive enumeration of all permutations is too computationally complex.

What is done then? Take a random sample of possible permutations!

The statistician EJG Pitman invented these tests long before we had computer power to conduct them.

Getting back to the last question by @ESMD:

You have the right idea. You have gone out and recruited a sufficient number of patients you wish to test an intervention on. You randomly assign them to treatment vs. control, then compute a group level difference.

To do the permutation test, simply use a computer program to take your observed data switch the labels (tx vs control) on the observations, compute a difference, and store the result. If you can compute all possible permutations, you have effectively re-done your experiment under the no effect hypothesis.

If your sample is too big for exhaustive enumeration, my comment above applies. You might think of your observed sample as a member of a population of all possible permutations of treatment assignments.

So, while I think the random sampling vs random assignment distinction is very important, and that it is misleading to link p values exclusively to a sampling frame of reference, sampling theory methods are useful and can show up in surprising places.

ChristopherTong · November 10, 2022, 3:35am

Frank, your first sentence seems to be the point that @EMSD is making and that I am supporting. It’s worth thinking about since the medical literature is predominantly frequentist, so we have to engage with it.

The sampling paradigm is used to derive most frequentist stat methods, but these are then applied to data from randomized trials with no justification. This is a non-sequitur. The permutation test people offer a rescue - if we can only admit that we are doing this to approximate the “correct” (to them) approach of using randomization tests, which are derived from a random assignment paradigm. Is this the right way to think about why the medical literature is the way it is? If so then @EMSD is correct that the textbooks should clearly state it, and by not doing so, they cultivate confusion.

My above comment proposes an alternate explanation- that it is possible to derive standard frequentist methods from a random assignment paradigm, but I’ve never seen this done, and I’m not smart enough to attempt it myself.

The third (and most cynical) explanation is that this is all just an insincere bait-and-switch by those who teach and write about frequentist stats. I hope there are other explanations that I didn’t think of.

It is worth quoting the final paragraph of the S. Senn blog that is cited in the OP:

What both methods have in common is that there is a theory to relate the inferences that may be made to the way that allocation (experiments) or selection (surveys) has been conducted. Whether the design should dictate the analysis, whether the intended analysis should guide the design or whether some mixture of the two is appropriate, is a deep and fascinating issue on which I and others more expert than me have various opinions but this blog has gone on long enough, so I shall stop here.

One way to read between the lines here is that current practice mismatches analysis to design - using sampling-based inference to analyze randomization-based experients. I would welcome clarification from him though.

ChristopherTong · November 10, 2022, 3:38am

Very good point, and this seems to be a more honest way to approximate a permutation test than what is usually done in practice, no?

ChristopherTong · November 10, 2022, 4:05am

Thank you @EMSD for the addendum regarding Lock Morgan’s paper. I should have known that a member of the Lock family has thought deeply about these points (the Lock^5 intro stats book is based on both randomization tests and bootstrap, rather than conventional parametric tests - link below). She seems to be supporting the view that sampling based methods and randomization-based methods can approximate each other, as they are asymptotically equivalent for large samples. So the “bait and switch” is defensible, if only authors and teachers would take the time to talk about this.

The Lock Morgan paper is excellent and a badly needed source of illumination. I will nitpick one point though -she attributes the random sampling framework to Neyman and Pearson 1928, but Fisher wrote about it in 1922.