Questions regarding survey analysis in Stata


Hi All,

I am doing an analysis using National Inpatient Sample which uses a complex survey design and provides weights to project national estimates. My unweighted study sample is about 1000 and weighted is about 5000. Recommendations are to use “svy” commands while working on survey data. My questions are-

  1. While performing a logistic regression analysis, I was told that I can do it 2 ways:
    a.) Using “svy: logit dep(var) indep(var), or” command. If I do this, the number of observations remain 1000 and population is 5000.
    b.) Expanding the dataset using “expand weight” command and then running “logit dep(var) indep(var), or” on the expanded sample. Here the number of observations are 5000.
    They both make sense and I have tried using both but they have slightly different results in terms of p values.

2.) I can’t calculate “median(IQR)” using “svy” commands. So what’s the alternative for measuring non-parametric continuous variables in such samples. The way I did it- expanded the sample using weights and then ran the command “sum variable, d” on 5000 patients. Is it appropriate? If I don’t expand the weights, my sample size remains 1000.

3.) What is the correct method to determine p values of trends in the categorical variables in the survey sample (for example- % of males who were admitted with myocardial infarction in last 10 years) and continuous variables (length of stay over the years). The recommended official command to determine p-value of trends in categorical variable is- “nptrend”. For continuous variables, linear regression is suggested. Agina, I am not sure how to take into account the survey design while doing this, so I am running these commands after expanding the dataset. Is there a more appropriate way?

I would really appreciate any help.

Thank you.



Why you await a real answer to your question, keep in mind that if you condition on the variables (adjust for them as covariates) that caused the weighted sampling, you can get conditional estimates of effects without needing to weight the analysis.



Perhaps a bit late and not completely sure about all your items, but here goes.

  1. I think the safest thing, even though p-values etc. might be quite similar, is to continue using the svy commands, for the following reason. The weights that you get with your dataset are sampling weights, which are inverse probability weights (so the inverse of chance of being sampled into the study). These weights are used in Stata after you set the survey design to reweight your sample, which for the analysis software makes it seem as though you have a (much) larger sample to work with that is representative of the entire country. However, this is sort of a false larger sample as you obviously didn’t measure your variables of interest in the entire country. The Stata SVY commands account for this when estimating confidence intervals, p-values etc. which is required, because else your software would estimate all these parameters assuming you have a much larger sample. So the SVY commands contain estimators for variance etc. that account for the fact that your sample after weighting is created somewhat artificially and you don’t actually have such a large sample to work with.
    From what you described, the expand weight approach ignores these concerns and for your analysis software makes it seem as though you actually have such a large sample. Essentially, what the expand weight method seems to do is create a bigger dataset which contains a series of clones of the participants in the smaller dataset and then runs the analyses on them, without accounting for the fact that in the general population individuals are like your sampled participants, but not identical to them. In other words, I think the expand weight approach ignores the concerns I mentioned above that the SVY commands are designed to take into account. Although the difference might be small as you noticed (probably due to your relatively large sample) I think it might still be wrong in some situations and it does not really hurt to be on the safe side and use the SVY options. Also see the following resource for some more info on weights in Stata:

  2. I’d like to refer you to the following page from Stata itself that goes into calculating percentiles (including the median) in survey data: In brief, use the pctile or _pctile function and pweights to calculate the median and any other percentiles (10th, 90th, interquartile range etc.) that you’d like.

  3. I am not completely sure if I understand what the issue is that you run into and what you are trying to determine. What exactly do you mean with trend in categorical variables. If you are just referring to comparing a dichotomous outcome between two groups you can use logistic regression. For your continuous variables, linear regression is recommended you say? Both of these commands can be run as svy/survey commands, which will automatically take your survey design into account? If you could clarify a bit what you like to do, maybe we can help you some more!

Good luck!




I agree with the previous reply. “Expanding your dataset” is an inappropriate use of sampling weights.

I would be surprised if your dataset does not have strata or cluster variables, make sure you include them in your svyset command. Using svy would be equivalent of using pweights with robust cluster, which can be used in situations where Stata does not allow the svy prefix.

Here is a useful link:

If you still have doubts, I would suggest you contact the managers of your dataset, as they could provide more specific advice to your data.

Hope this helps!

1 Like