Bayesian prediction when you have the whole population?

Hi everyone,

I was watching a lecture from Statistical Rethinking and got stuck on one particular slide about the Bayesian workflow:

In some prediction tasks we effectively have data on the whole population. How should the Bayesian framework be interpreted in this case?

2 Likes

My go-to answer comes from this post in another great Bayesian reference: How does statistical analysis differ when analyzing the entire population rather than a sample?

Gelman argues there that even if we have data on every single individual in a current population (e.g., all 50 U.S. states or every patient in a specific hospital), we should still use the Bayesian framework by treating that population as a sample from a hypothetical superpopulation. The idea is that we are rarely interested only in the N individuals currently in front of us. Instead, we are usually interested in the underlying data-generating process. The current population is just one realization of that process. By using a Bayesian model, we are making inferences about the parameters of the process that produced that population.

In prediction tasks, the Bayesian framework remains relevant because we are often predicting future units (potential outcomes) or counterfactual scenarios. Even if we have all the data from 2023, you are using it to predict 2024. The 2023 “population” is thus a sample of the possible “years” or “conditions” the system could produce.

An additional point is that Bayesian hierarchical models allow for partial pooling. For example, in a study of all 50 states, the “random effects” for each state aren’t just for generalizing to a 51st state. They serve to provide more stable, regularized estimates for the 50 states we actually have. The idea is to use Bayesian models to understand the weaknesses of a fitted model even when the data seems complete.

5 Likes

How is this not frequentism?

1 Like

Bayesians are frequentists..

My currently preferred method of classification here, with related Gelman post here. :slight_smile:

3 Likes

Since a primary aim of Bayes is to uncover hidden data generating processes why does this setting require consideration of a super population? Why are we not content with trying to uncover the data generating process for the population at hand?

1 Like

2 Likes

Conversely, if one writes down a DGP in the form of a probability distribution, how could this not be taken to refer implicitly to the sample space underlying that distribution?

2 Likes

That’s a very easy question David. The sample space does not even need to exist for the Bayesian uncertainty distribution to have a complete meaning. Bayes is largely based on a non-frequency interpretation of probability as a measure of information, belief, or veracity.

1 Like

How does this work? Are there formulations of probability theory that don’t invoke the idea of a sample space?

1 Like

Oh yes - that’s the whole school of subjective probability, e.g. P(World War III will break out) is a one-time event for which frequencies over trials are not meaningful. I like IJ Good’s notions the best. He viewed probabilities as information measures. Consider a single poker game, which is a one-time event. You may have a different probability idea of getting a certain hand based on your knowledge that a certain card was scuffed when no one else notice. Your probability is then different from theirs because you have information they don’t have. (This was Good’s example). See Glossary of Statistical Terms.

1 Like

Ah, but I was addressing the probability distribution ‘under the hood’ of the DGP. Setting aside subjective beliefs over the parameters of that distribution, doesn’t writing down a probabilistic DGP involve positing a probability triple?

1 Like

Good question. The argument is that the “data-generating process” is, by definition, a probability model. If we believe the N observations we have were generated by a process, then those N points are just one of an infinite number of possible realizations that the process could have produced. If we only care about the N units then the “data-generating process” is trivial: it is simply the list of those N values. There is no “hidden” process to uncover because there is no noise or uncertainty; there is only the observed state.

To uncover a generative process, we must assume the N units are a sample from the set of all things that could have been generated. This set of all possible realizations is the superpopulation.

Even if we have the whole population (e.g., every person in a country), we rarely have the “whole population” of conditions. There is, for example, temporal uncertainty whereby we may have the whole population in 2024 but want to understand the process so that we can predict the population in 2025. And there is also counterfactual uncertainty where we want to know what would have happened, e.g., if a different policy had been implemented affecting the 2024 population. We use the superpopulation to quantify how much the data-generating process might vary across these unobserved states. The superpopulation is the mathematical bridge that allows us to talk about parameters as something that exists independently of the specific, noisy realization we happened to observe.

It is true though that instead of combining this frequentism-as-model with Bayesian inference, we could use an “epistemic-probability-as-model” approach that does not need a superpopulation. Gelman’s counterargument is that by ignoring the superpopulation, an epistemic model risks being overconfident by mistaking a single realization of a process for the process itself. To call something a “process” implies it could happen again, and “happening again” is the definition of a superpopulation.

4 Likes

process = potentiality

1 Like

Glen Shafer and others discusses the measure theoretic and game theoretic approaches to probability in this paper:

“How to base probability theory on perfect-information games”, by Glenn Shafer, Vladimir Vovk, and Roman Chychyla (December 2009). (pdf)

Fermat’s method is to count the cases where an event A happens and the
cases where it fails; the ratio of the number where it happens to the total is the
event’s probability. This has been called the classical definition of probability…
Pascal’s method, in contrast, treats a probability as a price…

I’d be the last one to dispute the subjective theory of probability, but frequency models can help compute the value of this information. This private information might be viewed as a deviation from the fair value computations based upon either the frequency or game theoretic interpretation.

In “How the game-theoretic foundation for probability resolves the Bayesian vs. frequentist standoff”, by Glenn Shafer (first posted August 2020). (pdf) Shafer writes:

When we make an all-or-nothing bet on an event, it is natural focus our minds
on the frequency of the event in repeated trials, even if these repeated trials
are entirely imaginary.
But when we make bets that are not all-or-nothing,
we are taking a rst step beyond frequentism.

More game theoretic probability papers can be found here:

@pavlos I don’t find any of that convincing. We are putting probabilities on parameters to encode uncertainties about whatever particular single values they may have. I wish to uncover the parameters that capture the generating process of the one dataset. If I want to reason that these same parameters apply to another set of data that may be assembled, e.g. another health system, all I have to argue is that they have the same data generating process as the one I analyzed.

1 Like

Kind of related I guess - what do you all think about Jeffreys Priors?

1 Like

That is a huge and quite divisive topic: starting with the first major division between Bayesians (objective versus subjective), Jeffreys goes under the “objective” classification, thus the subjectivists will not like it to begin with :slight_smile: Then within the objective Bayes world there are arguments that Jeffreys often can have bad frequentist properties versus matching priors, invariance priors and reference priors. And so on…

Some of its distinguishing characteristics are that it is derived from the Fisher information matrix (and thus has more direct information-theoretic interpretation), it is invariant under reparameterization, and it is meant to be a non-informative default prior (although it is debatable how non-informative it truly is given its tendency to pull things towards zero). I am personally not against Jeffreys for simple models where we need an invariant default and can easily verify whether the posterior is proper.

But were I to continue wearing my Gelman hat in this thread, these would be some arguments against Jeffreys:

  1. They are often improper (don’t integrate to 1) leading to improper posteriors particularly in more complex hierarchical models. That is a fatal problem as we want in hierarchical models to generate reasonable posterior inferences across the full parameter space.

  2. They are not really “non-informative” and put substantial mass near zero, which can pull variance estimates toward zero in hierarchical models, which is exactly the wrong direction when we have few groups. Instead, Gelman likes half-Cauchy or (more recently) half-normal priors that gently regularize without being strongly opinionated.

  3. This is the point I learned best from his work: the classical argument that priors should reflect ignorance (including parameterization invariance) should be flipped for many practical applications, including much of my clinical medicine work (albeit less so for our very high-dimensional correlative analyses etc). The ability to directly encode information through priors is one of the strongest potential advantages of Bayes versus standard frequentist or fiducial inferences (and modern derivatives like conformal prediction etc). Similarly, we should choose a parameterization that is meaningful for our problem and then put sensible priors on those meaningful quantities.

3 Likes

Well said Pavlos. Instead of specifying a non-informative prior I like to think of omitting the prior entirely so that Stan doesn’t try to add any extra-data information and just normalizes the likelihood. To your excellent point about subjective Bayesians disliking Jeffreys’ priors, priors need to be subjective just like the whole model specification does, and the most frequently used prior needs to be one that places constraints on an effect, because that’s often all we know. For example we may know that a treatment was designed to be incremental and not curative so the prior shouldn’t have mass near an odds ratio of zero. A “non-informative” prior will allow arbitrarily ridiculous odds ratios to be considered plausible.

I hope that my earlier “did not find that convincing” post above cajoles you into some further thoughts.

2 Likes

Ah yes. My very preliminary thoughts were that your (Bayesian prediction) view appears to align more with the Epistemic-Probability-as-Model (EPaM) view in the distinction I linked above. The “superpopulation” use is a way to provide an aleatory interpretation, consistent with the Frequentism-as-Model (FaM) view. Your statement: “We are putting probabilities on parameters to encode uncertainties about whatever particular single values they may have.” is almost a textbook epistemic reading of parameter probability whereby no superpopulation is needed for the inferential framework to be coherent. The key point of course is that both EPaM and FaM are idealizations that do not map perfectly onto reality. Cognitively I have thought much more in terms of FaM (including in my Bayesian modeling) but can see the value of further developing my EPaM muscle. What comes below is an embryonic attempt to do that.

Were I to summarize your position in EPaM terms (let me know if this is correct):
-Parameters are fixed but unknown
-Posteriors quantify subjective uncertainty
-Transportability is a separate scientific judgment about mechanism equivalence

This gives up the automatic license to generalize that comes with the superpopulation framing. Under FaM, if your data is a sample from a superpopulation, you’re automatically making inferences about that superpopulation. Under EPaM, generalization requires a separate argument about mechanism invariance. Whether this is a bug or a feature depends on your philosophy of science. Interestingly, the EPaM view may align with our core thesis here (also touched upon in this thread) that we should be explicit about how we think mechanisms transfer across contexts, rather than having this baked into the probability framework via a superpopulation framing.

So perhaps I may end up continuing using FaM mainly for model building (it just comes naturally to me when making causal inferences from our experimental data etc) but lean more on EPaM for transportability tasks, certainly when making patient-specific inferences in the clinic (that may be my most epistemic domain currently). Or maybe somehow learn to use both simultaneously for all tasks :slight_smile:

5 Likes

Thank you both! I’ve learned so much from this thread!

My motivation behind the question is trying to look at a frequenstist test as a private case of a bayesian with specific priors.

Let’s say that we are trying to compare proportions of COPD cases for smokers and non smokers in an observational setting:

Non-Smokers: 7/10

Smokers: 2/6

Frequentist approach → Fisher Exact Test.

Bayesian Jefferies Priors approach → Priors Beta(1/2,1/2)

Bayesian Flat Priors approach → Priors Beta(1, 1)

Does that sound right?

1 Like