Survey Non-Response Weighting Stata

Hello Everyone,

My question is very specific and it looks towards adjusting for non-response in a survey that has no design weight (or any weight for that matter). I need help in finding out how to solve this problem using stata and was wondering if anyone of you could kindly paste an example from one of their work where they used stata to adjust for unit non-response.

The dataset I have is of a government training program that that has around 127000 observations in total. These 127000 trainees were then samples using a SRS on gender whereby we drew around 6700 observations. When the team was sent out to survey these 6700 people, around 4200 responded while the remainder were either not available or refused to respond.

I am really stuck and getting confused each time i browse the internet for help as there is hardly any method out there performed in stata which I can use to understand how someone rightly solved this problem. I know this sounds desperate but I really need help for this.

Hope to hear from the community.

PS: I know this is non health related however the method I am looking for probably works in every case which is why I am thought of asking the question here

1 Like

SRS = stratified random sample?

And the core of your question is to adjust for non-response right? So you are assuming your non-responders have different characteristics than the responders (which is likely true). As you don’t have any weights (yet), do you have any information on characteristics (e.g. sex, age, education, anything?) from both the responders and non-responders? Or was the goal of the survey to collect various characteristics and you now only have them in the responders?

1 Like

Hello scboone (not sure about your real name)

Yes, the sample was a stratified random sampling (gender-wise).

The core is definitely to adjust for non-response. I do have information for both responders and non-responders which includes age, sex etc but not marital status. I am trying to use stata to form these weights and I have been searching online for a concrete method for going about this. If you want I can somehow send you the datafile which you can also have a look at but at the moment I need to understand exactly how to go about adjusting for non-response.

I have read on post-stratification which is a calibration for non response weights but to calibrate the method in use talks about using logit to determine propensity score and take its predicted inverse. However, the clear steps are not provided and it will be wonderful if you can kindly help me out.

Thank you

1 Like

Sorry for bumping this post, just thought I ask if you might have figured out something

I guess post-stratification is a good option. All you have to do is use response (No=0, Yes=1) as the outcome in a logistic regression model. The model should include all the variables you have both for the responders and non-responders (age, sex, etc). After fitting the model, predict the probability of response § for for each individual. Then take 1/P as the weight for responders and 1/(1-P) as the weight for non-responders. Once you have the weights, verify they sum up to the total of your observations (6700). To do the analysis, svyset your data and use svy: [whatever command you need here] [var: whatever variable you need here]. I haven’t done this in a long time. Thus, I suggest you check my definition of the weights.

If this seems to complicated, and you have the age, gender, whatever other-variable(s) distribution for the whole population (the 6700 or the 127000) you can just do a direct standardization. Actually, that’s what you are doing above.

I hope this helps.


Apologies for not following up on your posts, but what @lbautista describes is I think the right way to go.

Assuming you have a variable ‘response’ with 0 = no response and 1 = response on the survey, the STATA syntax would look similar to this:

logistic response var1 var2 var3
predict respons_prob
generate weight_response = 1 / respons_prob
replace weight_response = 1 / (1 - respons_prob) if(response == 0)

Then you can use either the svyset command or pweight for the command you like. For svyset, the syntax would look like this:

svyset _n [pweight = weight_response] 
svy: mean var1
svy: regress var1 var2


This sets a simple survey design with no clustering etc. I’m not completely sure if this applies to your situation as you performed a stratified randomization, but my knowledge on survey analyses is limited too. Alternatively, you can use the pweight option for a lot of commands which creates more or less the same results:

mean var1 [pweight = weight_response]
regress var1 var2 [pweight = weight_response]

Thank you ibautista I appreciate your help, i will have a look into this and get back to you if I have further queries.

thank you scboone i will definitely look into this and see how it goes for me. Thank you for taking out time to reply!

Hi , I found this thread very useful for my analysis. Thank you to all of you. I was wondering if you have a reference for the formula mentioned, i.e for responders, weight is 1/P and for non-responders, weight is 1/(1-P). Thanks.

Look up propensity score analysis for non-response. In the longitudinal setting, Phil Lavori showed that propensity weighting to deal with dropout over time results in hugely inefficient estimates.

Thank you! I have been reading about propensity scores. Mine is a crosssectional survey.

Some of the major inefficiency of weighted analysis will also apply there.