What should be the denominator for an online national survey?

Dear members,

I planned to conduct a prevalence survey (on a rare disease) in the general adult population in the UK. My initial plan was to distribute the survey to 6000 randomly selected addresses and also make it available online. Unfortunately, I could only go ahead with the online version due to a lack of funding. I am now wondering what the appropriate denominator should be for a national online survey.

Any insights or evidence would be greatly appreciated.

Kind regards,


Please clarify. Did the survey ask a yes/no question about whether the respondent has the disease of interest? Or did you ask people to respond who had the disease? For the latter there is no way to do what you want. The design would only allow for the estimation of the ratio of the prevalence of the disease of interest to the prevalence another condition, if the survey included a question about another situation.

On the other hand if you asked people to respond yes/no to a disease question, the denominator is the number of respondents, but the survey has unknown bias and is not to be trusted. You would have no idea whether those with the disease are more or less likely to answer the internet survey than those without the disease. So the prevalence estimate can be badly wrong.

An instructive example is the mess that Google made from trying to estimate the time-varying prevalence of flu. As more people heard about the flu, more of them ran Google searches on “flu” which bumped up the estimated prevalence and made it unreliable. Internet surveys can be good for correlative analyses but not for estimating absolutes. For example you may be able to estimate the correlation between the respondent’s sex and disease status.

1 Like

Dear @f2harrell,
Thank you for sharing your feedback. Just to clarify, the survey includes a yes/no question about whether the respondent has the specified health condition. Based on the data, it appears that there are a significant number of respondents who have been diagnosed with the condition, which could potentially result in a misleadingly high prevalence rate, as the condition is actually rare. I was considering using the current adult population of the UK as the denominator instead. Do you think this would be a more appropriate approach?

So your proposed estimate of the prevalence is \frac{\text{# of people who responded Yes on survey}}{\text{Total adult pop. of UK}} ?

If that’s the case, there’s no way that’s going to give you a valid estimate of the disease prevalence. That numerator and denominator are not related in any meaningful way.

The fact you’re seeing a lot of respondents saying they have what should be a very rare disease is indicative of what @f2harrell already mentioned, which is that your internet survey is almost certainly biased and not providing good data.

1 Like

Ideally, if you asked additional questions in the questionnaire, and you assume those questions indicate variables that are related to the responsiveness, then you could standardize over them or reweigh them to (at least partially) cancel out the bias due to differential response.
You might not assume your \frac{\# \text{people responding yes}}{\# \text{total people responding}} generalizes to the UK population as is, but you could adjust it - say, using known census data - so it would generalize better.

For example, if you asked for gender and discover proportion of women is 70% instead of what should be 51% in the UK, you could adjust for and post-stratify/reweigh your sample so that it would appear you have 51% women, and you disease prevalence would be adjusted accordingly.

I’m not expert in surveying (though methods correspond to those from causal inference), but it might be worth consulting the literature for methods adjusting for nonresponse in surveys.

That assumes non-informative non-response, i.e. missing at random, i.e. reasons for missings are captured in non-missing data or are totally random. That’s not a reasonable assumption here, I don’t think.