Illogic of Weighting While Estimating Model Parameters

f2harrell · July 29, 2025, 12:11pm

Consider a situation where some types of subjects are over- or under-sampled from a population where the sampling probabilities are known. For example, to increase efficiency we might sample a greater proportion of symptomatic patients than we sample symptom-free patients.

There has always been a division between survey research statisticians and non-survey statisticians. Survey statisticians demand that sampling weights be used when analyzing data. In a totally different situation, advocates of propensity scores often desire to incorporate inverse propensity weights into regression coefficient estimation, which is sure to be inefficient, In the sample survey sphere, incorporating sampling weights allows one to make estimates targetting an unweighted population with its own covariate distribution (but how to interpret estimates when a weighting variable is also a covariate?). Weighted maximum likelihood estimation (MLE) is not MLE however. It does not respect the sample sizes of different strata when the weights vary. If females were oversampled, weighted estimation downweights females in the data and upweights males. The lack of respect for the original sample size for each sex is problematic, e.g., it increases the variances of estimates.

Non-survey statisticians, especially biostatisticians or Bayesian modelers, tend to be conditionalists, i.e., target the estimation of Y for a specific set of covariate values. For that purpose, MLEs are derived from unweighted analyses and the estimand corresponding to predicted values is extremely easy to define. A model containing only sex where the reference cell is for females has the intercept estimating mean Y for females in the population, and the intercept plus the sex regression coefficient estimates E(Y) for males in the population. If one wanted to estimate unconditional E(Y) in the population, conditionalists weight the two predicted values and sum. The regression coefficient estimates are fully efficient MLEs. (This is not so interesting for a single-covariate model but the argument becomes more relevant when there are multiple covariates).

How does one interpret the intercept in a weighted model? It’s not easy.

I urge analysts to use maximum likelihood estimation, i.e., to use fully conditional models that include the weighting factors as additional covariates. Then parameter estimates will be fully efficient and predicted values can be used to obtain any quantity of interested, including population-weighted marginal estimates.

I would appreciate some discussion.

baogorek · July 29, 2025, 9:00pm

I know that not all survey statisticians demand that weights be used. Two decades ago, I remember an NCSU sampling class where the professor suggested the model-based approach “was better.” I believe he was speaking about the approach where you throw away the weights, model the data, and then predict the values that weren’t in the sample. In a different context, a Bayesian remarked that sampling was an area where Bayesian and Frequentist ideas intersect. I wonder if making inferences about unseen values of the data brings us all together (or if he was talking about something else).

An architype of a person that might insist on using weights is the Automated Survey Analysis Maintainer, say the one over at data.census.gov. No need to model the 1000s of variables and update those models every year as new data arrives. Plug in the common set of weights, do a dot product, and you have unbiased estimation of a finite population total. It’s high-throughput “valid” estimation. I think this architype might be onto something.

Another is what I call The Rule Follower. “No, you have to use the weights, or estimates will be biased. They come with the survey. You can’t just throw them away!” It’s impossible to argue with this architype.

There’s The Tinkerer. “I’m gonna weight by population size. See there’s a weight argument I can plug it into in lm.” It’s difficult to argue with this architype, because you don’t share a common framework.

Finally, there’s The Calibrator. “I’m gonna weight so my estimates match this auxilliary information.” I’ve been struggling to find satisfying theory here, so it feels like a cross between The Tinkerer and The Survey Statistician.

f2harrell · July 29, 2025, 10:57pm

Welcome to datamethods Ben!

Those are great insights and covers a lot of angles. The only part I take issue with is someone saying “No, you have to use the weights, or estimates will be biased.” . The thought that estimates not using weights are biased is not justified by any statistical theory I’ve heard of. So I would find it very possible to argue with that architype. And I’d like to ask the perpetrator to interpret the intercept in a weighted model fit …

baogorek · July 30, 2025, 1:31am

Thanks @f2harrell ! Cool space you’ve set up here!

I’m with you on bias properties. I was more trying to illustrate the compulsion of otherwise competent people to use the weights, lest something bad will happen. Like, you just threw away half the pieces to their new board game, and now it’s time to play.

On design-unbiasedness, I think about the circus elephant weight example from Basu often. Things can get really bad under that umbrella.

I asked Gemini about what that Bayesian might have meant about the Frequentist and Bayesian perspectives aligning in finite population sampling. Curious whether you agree.

In this predictive framework, the distinction between estimating a fixed parameter and predicting a random variable becomes blurred. The unobserved values in the finite population can be treated as random variables from a Bayesian perspective, and the models used to predict them often lead to estimators that are very similar, or even identical, to those derived from frequentist methods.

In a predictive setting, the data you have provides direct evidence about what future data will look like. This evidence creates a strong likelihood function[, overwhelming priors]

f2harrell · July 30, 2025, 11:25am

Very interesting … and too deep for me early in the morning

license007 · August 5, 2025, 3:46am

I think you are advocate for the “model first, standardize/weighting later” to maximize statistical efficiency with survey data?

This is clear with well-set up experiments where the intention to sample is consciously made. What about claim databases that are repurposed for research purposes where sampling scheme is not clear/made conscious?

f2harrell · August 5, 2025, 11:34am

I’m assuming that studies that do not use probability sampling do not apply here, i.e., weighting is not relevant.

license007 · August 6, 2025, 1:29am

I see. Do you have any thoughts on using weights to approximate the data generating process?

f2harrell · August 6, 2025, 11:21am

If a variable affected the data generating process and you don’t measure it you can’t do anything anyway. If you measure it then you condition on it, not weight by it..

license007 · August 9, 2025, 3:00am

I might have been imprecise with my wordings, I did not mean unmeasured confounding.

I want to ask your opinions on weighting of samples from administrative databases. These databases usually lacks explicit sampling probabilities but contains vast information that can be used to estimate the probabilities/propensity to an exposure/ a drug.

I’ll take a toy example of measuring the central tendency to set the context:

Say I want to estimate the average income of alumni of a university. I walked around the campus and asked 10 random alumni and collected these numbers: 8 people with incomes of $5000/month working junior level jobs, 1 with $12,000/month works as manager level and 1 with $28,000/month works at director level. What would be the average income?

The naive answer would be the mean/median. However, if I knew from other sources that around 20% of alumni are in entry level, 60%- manager level, and 20%-director level; then, I can use these “frequency weight” to approximate the data generating process to get a more valid inference for the central tendency of the population: 0.2* 5000 + 0.6 * 12000 + 0.2 * 28000

Extrapolate this toy example to the claim/commercial databases, do you think it’s conceivable that we can use large number variables within the database to compute the propensity/probability to the exposure and then using this as weights?

I saw the main concerns are:

This depends on the accuracy of the record data, garbage in- garbage out
Even with the right data, analysts need to get the right specification of models. So, need to get 2 right, lower the chance of success significantly
Unmeasured confounders - kindna cliche (aren’t most things correlated with one another? Given a large enough number of variables, would this been solve? Like llm with billions of parameters)

I’ve been thinking about weighting and conditioning. For estimating the central tendency, isn’t these are two different words for the same mathematical process?

f2harrell · August 9, 2025, 12:16pm

This is a great example for thinking about conditioning. Is it helpful to know the average income when the consumer of this average doesn’t have a feel for the mixture of persons who went into the average? Even if they do understand this distribution? I think not. I would rather see the estimates conditional on the categories you listed.

license007 · August 10, 2025, 4:00pm

Of course, I concur with you on knowing the treatment heterogeneity.

When I read literature most of the time, I only saw reports of average treatment effect. I’m trying to understand where does it came from and discovered the above related to weighting and claimed databases. I think the set up for avg treatment effect is much more easier to do and explain than the set up for heterogeneous effect.

Anyways, I hope this has been helpful for the discussion on weighting.

f2harrell · August 10, 2025, 9:48pm

There are two issues that one shouldn’t mix. In an RCT the primary goal is to estimate relative treatment benefit, and this quantity is capable of being constant over a wide variety of patients. When not estimating relative effects but instead one is estimating absolutes in a one-sample problem such as the income example, there will be massive differences in mean income by age and other factors.

Average treatment effects on say a risk or life expectancy scale are very hard to interpret. People think the concepts are easy only because they don’t understand the nuances.

license007 · August 11, 2025, 4:04am

Ah, this is a very nice comment! I saw and can put to words another dimension of average treatment effect now. I always feel something is off about the average treatment effect for risk measurement because it does not translate into actionable insight for individuals.

I’ll chew on the ramifications of this for a while.