Gelman argues there that even if we have data on every single individual in a current population (e.g., all 50 U.S. states or every patient in a specific hospital), we should still use the Bayesian framework by treating that population as a sample from a hypothetical superpopulation. The idea is that we are rarely interested only in the N individuals currently in front of us. Instead, we are usually interested in the underlying data-generating process. The current population is just one realization of that process. By using a Bayesian model, we are making inferences about the parameters of the process that produced that population.
In prediction tasks, the Bayesian framework remains relevant because we are often predicting future units (potential outcomes) or counterfactual scenarios. Even if we have all the data from 2023, you are using it to predict 2024. The 2023 âpopulationâ is thus a sample of the possible âyearsâ or âconditionsâ the system could produce.
An additional point is that Bayesian hierarchical models allow for partial pooling. For example, in a study of all 50 states, the ârandom effectsâ for each state arenât just for generalizing to a 51st state. They serve to provide more stable, regularized estimates for the 50 states we actually have. The idea is to use Bayesian models to understand the weaknesses of a fitted model even when the data seems complete.
Since a primary aim of Bayes is to uncover hidden data generating processes why does this setting require consideration of a super population? Why are we not content with trying to uncover the data generating process for the population at hand?
Conversely, if one writes down a DGP in the form of a probability distribution, how could this not be taken to refer implicitly to the sample space underlying that distribution?
Thatâs a very easy question David. The sample space does not even need to exist for the Bayesian uncertainty distribution to have a complete meaning. Bayes is largely based on a non-frequency interpretation of probability as a measure of information, belief, or veracity.
Oh yes - thatâs the whole school of subjective probability, e.g. P(World War III will break out) is a one-time event for which frequencies over trials are not meaningful. I like IJ Goodâs notions the best. He viewed probabilities as information measures. Consider a single poker game, which is a one-time event. You may have a different probability idea of getting a certain hand based on your knowledge that a certain card was scuffed when no one else notice. Your probability is then different from theirs because you have information they donât have. (This was Goodâs example). See Glossary of Statistical Terms.
Ah, but I was addressing the probability distribution âunder the hoodâ of the DGP. Setting aside subjective beliefs over the parameters of that distribution, doesnât writing down a probabilistic DGP involve positing a probability triple?
Good question. The argument is that the âdata-generating processâ is, by definition, a probability model. If we believe the N observations we have were generated by a process, then those N points are just one of an infinite number of possible realizations that the process could have produced. If we only care about the N units then the âdata-generating processâ is trivial: it is simply the list of those N values. There is no âhiddenâ process to uncover because there is no noise or uncertainty; there is only the observed state.
To uncover a generative process, we must assume the N units are a sample from the set of all things that could have been generated. This set of all possible realizations is the superpopulation.
Even if we have the whole population (e.g., every person in a country), we rarely have the âwhole populationâ of conditions. There is, for example, temporal uncertainty whereby we may have the whole population in 2024 but want to understand the process so that we can predict the population in 2025. And there is also counterfactual uncertainty where we want to know what would have happened, e.g., if a different policy had been implemented affecting the 2024 population. We use the superpopulation to quantify how much the data-generating process might vary across these unobserved states. The superpopulation is the mathematical bridge that allows us to talk about parameters as something that exists independently of the specific, noisy realization we happened to observe.
It is true though that instead of combining this frequentism-as-model with Bayesian inference, we could use an âepistemic-probability-as-modelâ approach that does not need a superpopulation. Gelmanâs counterargument is that by ignoring the superpopulation, an epistemic model risks being overconfident by mistaking a single realization of a process for the process itself. To call something a âprocessâ implies it could happen again, and âhappening againâ is the definition of a superpopulation.
Glen Shafer and others discusses the measure theoretic and game theoretic approaches to probability in this paper:
âHow to base probability theory on perfect-information gamesâ, by Glenn Shafer, Vladimir Vovk, and Roman Chychyla (December 2009). (pdf)
Fermatâs method is to count the cases where an event A happens and the
cases where it fails; the ratio of the number where it happens to the total is the
eventâs probability. This has been called the classical deďŹnition of probability⌠Pascalâs method, in contrast, treats a probability as a priceâŚ
Iâd be the last one to dispute the subjective theory of probability, but frequency models can help compute the value of this information. This private information might be viewed as a deviation from the fair value computations based upon either the frequency or game theoretic interpretation.
In âHow the game-theoretic foundation for probability resolves the Bayesian vs. frequentist standoffâ, by Glenn Shafer (first posted August 2020). (pdf) Shafer writes:
When we make an all-or-nothing bet on an event, it is natural focus our minds
on the frequency of the event in repeated trials, even if these repeated trials
are entirely imaginary. But when we make bets that are not all-or-nothing,
we are taking a ďŹrst step beyond frequentism.
More game theoretic probability papers can be found here:
@pavlos I donât find any of that convincing. We are putting probabilities on parameters to encode uncertainties about whatever particular single values they may have. I wish to uncover the parameters that capture the generating process of the one dataset. If I want to reason that these same parameters apply to another set of data that may be assembled, e.g. another health system, all I have to argue is that they have the same data generating process as the one I analyzed.
That is a huge and quite divisive topic: starting with the first major division between Bayesians (objective versus subjective), Jeffreys goes under the âobjectiveâ classification, thus the subjectivists will not like it to begin with Then within the objective Bayes world there are arguments that Jeffreys often can have bad frequentist properties versus matching priors, invariance priors and reference priors. And so onâŚ
Some of its distinguishing characteristics are that it is derived from the Fisher information matrix (and thus has more direct information-theoretic interpretation), it is invariant under reparameterization, and it is meant to be a non-informative default prior (although it is debatable how non-informative it truly is given its tendency to pull things towards zero). I am personally not against Jeffreys for simple models where we need an invariant default and can easily verify whether the posterior is proper.
But were I to continue wearing my Gelman hat in this thread, these would be some arguments against Jeffreys:
They are often improper (donât integrate to 1) leading to improper posteriors particularly in more complex hierarchical models. That is a fatal problem as we want in hierarchical models to generate reasonable posterior inferences across the full parameter space.
They are not really ânon-informativeâ and put substantial mass near zero, which can pull variance estimates toward zero in hierarchical models, which is exactly the wrong direction when we have few groups. Instead, Gelman likes half-Cauchy or (more recently) half-normal priors that gently regularize without being strongly opinionated.
This is the point I learned best from his work: the classical argument that priors should reflect ignorance (including parameterization invariance) should be flipped for many practical applications, including much of my clinical medicine work (albeit less so for our very high-dimensional correlative analyses etc). The ability to directly encode information through priors is one of the strongest potential advantages of Bayes versus standard frequentist or fiducial inferences (and modern derivatives like conformal prediction etc). Similarly, we should choose a parameterization that is meaningful for our problem and then put sensible priors on those meaningful quantities.
Well said Pavlos. Instead of specifying a non-informative prior I like to think of omitting the prior entirely so that Stan doesnât try to add any extra-data information and just normalizes the likelihood. To your excellent point about subjective Bayesians disliking Jeffreysâ priors, priors need to be subjective just like the whole model specification does, and the most frequently used prior needs to be one that places constraints on an effect, because thatâs often all we know. For example we may know that a treatment was designed to be incremental and not curative so the prior shouldnât have mass near an odds ratio of zero. A ânon-informativeâ prior will allow arbitrarily ridiculous odds ratios to be considered plausible.
I hope that my earlier âdid not find that convincingâ post above cajoles you into some further thoughts.
Ah yes. My very preliminary thoughts were that your (Bayesian prediction) view appears to align more with the Epistemic-Probability-as-Model (EPaM) view in the distinction I linked above. The âsuperpopulationâ use is a way to provide an aleatory interpretation, consistent with the Frequentism-as-Model (FaM) view. Your statement: âWe are putting probabilities on parameters to encode uncertainties about whatever particular single values they may have.â is almost a textbook epistemic reading of parameter probability whereby no superpopulation is needed for the inferential framework to be coherent. The key point of course is that both EPaM and FaM are idealizations that do not map perfectly onto reality. Cognitively I have thought much more in terms of FaM (including in my Bayesian modeling) but can see the value of further developing my EPaM muscle. What comes below is an embryonic attempt to do that.
Were I to summarize your position in EPaM terms (let me know if this is correct):
-Parameters are fixed but unknown
-Posteriors quantify subjective uncertainty
-Transportability is a separate scientific judgment about mechanism equivalence
This gives up the automatic license to generalize that comes with the superpopulation framing. Under FaM, if your data is a sample from a superpopulation, youâre automatically making inferences about that superpopulation. Under EPaM, generalization requires a separate argument about mechanism invariance. Whether this is a bug or a feature depends on your philosophy of science. Interestingly, the EPaM view may align with our core thesis here (also touched upon in this thread) that we should be explicit about how we think mechanisms transfer across contexts, rather than having this baked into the probability framework via a superpopulation framing.
So perhaps I may end up continuing using FaM mainly for model building (it just comes naturally to me when making causal inferences from our experimental data etc) but lean more on EPaM for transportability tasks, certainly when making patient-specific inferences in the clinic (that may be my most epistemic domain currently). Or maybe somehow learn to use both simultaneously for all tasks