Imbalance in pre-post studies

If I were to anticipate that some people will drop out after the baseline assessment in an observational pre-post study of treatment, would I be wiser to limit recruitment to those most likely to return for the second assessment (e.g., people who live locally), or is there value in ‘orphaned’ first visit data?

Pre-post designs do not allow statistical inference in general, and the big medical journals simply don’t accept them. Under a set of amazingly restrictive conditions you can do inference, but the design assumes that no one drops out. Dropouts in pre-post studies are fatal. If you can limit recruitment to those that have a probability of 1.0 of getting both pre and post measurements, you can proceed.


Thanks, Frank.

What sorts of alternatives do you find yourself recommending when people come to you thinking of a pre-post design? Let’s say the pre-post design involves assessing biology prior to and after the cure of a disease.

Sorry I jumped to the conclusion that this was an intervention assessment. With a fairly low amount of missing data, pre-post may be fine for your purpose. There sometimes is an issue of regression to the mean requiring multiple baselines if the response variable is also used at baseline to select patients. For example if you enroll patients with SBP > 140 and study their SBP, you’ll get a good many patients reduce their SBP just because their original SBP was measured high because of error.


Understood. Thank you.

Is it most proper to censor the data from the person who drops out? Or do their data contribute to the analysis?

How should this affect whether we consider distance from our site as an enrollment criterion since we anticipate this distance will impact the likelihood of accruing data in the ‘post’ period?

Necessary caveat: as Frank said, a simple PRE vs. POST assessment of people getting a treatment without a concurrent control group not-receiving the treatment is not very good for assessing the size of treatment effect, because of the potential for regression to the mean. It’s hard to know how much of the observed “change” is due to the treatment versus the simple passage of time. As a hypertension researcher, you’re very well aware of this - e.g. if you enroll people with SBP above a certain cutoff (SBP>140) and just measuring them again a week later (with no treatment), it’s likely that the mean of the second measurement will be lower than the mean of the first measurement.

That said, if there is a good case where a simple PRE-POST without a comparison group is useful, you’ve hit on another critical issue: those missing POST data will typically be some sort of informative or non-ignorable missingness (the most obviously problematic being if the patients missing your follow-up assessment have died, or have gotten sicker and are unable to come in for the assessment). It’s a little murkier with something like “distance from site” - people who are further away may be less likely to return for the POST assessment, but is there a reason that would be expected to also influence the outcome itself? e.g. it would only seem to introduce bias if the reason for missing (greater distance) was also related to the study outcome.

If you are able to collect data on the reasons for missing (e.g. at least differentiate between “patient died” or “patient was too sick” versus “patient just did not show up”), you might be able to use that as part of the analysis.


That’s very helpful to know!! Thank you