I’ve been playing with the synthpop package in R and it went through my mind that I may be able to use synthetic data sets generated from my real data set in several circumstances. However, I also think there may be some issues with these and I would appreciate the wisdom of others to think through these.

Are there circumstances in which synthetic data sets could be better than bootstrapping when developing and internally validating regression models?

one possibility I thought of is where calibration at the very low probability range (eg <1%) is particularly important. Unless one has massive data sets calibration in this range is determined by the very few events that occur despite the low prediction. For example, in risk prediction models for myocardial infarction the troponin biomarker dominates the risk prediction. In some, rare, circumstances a very low troponin concentration may be measured, yet the person is (later) diagnosed with an MI. Bootstrapping may help, but it still reflects the specific concentrations of these exceptions. However, a synthetic data set may sill have some MI with an initial low troponin, but the actual values of the troponin are likely to be spread a little. Could this improve calibration? How do you think I could test this?

Could synthetic data sets be used to develop optimal models?

for example, I may have a data set with a couple of thousand subjects and say, a 10% event rate. If I only have 5 or 6 variable to consider, but want to test a few interactions, would it be viable to develop a number of synthetic data sets of the same size as the real data set and use them to try and determine the optimal model. The form of this model may then be run on the original data set to obtain the beta coefficients for validation elsewhere. Viable?

a second possibility would be to create a synthetic data set much larger than the real data set and then develop a model with, say 30 degrees of freedom - ie a number that if developed in the real data set would likely result in over fitting. In this scenario then the beta coefficients would have to be taken from the model developed in the synthetic data set. I expect this is sub-optimal, but it may have its use. Thoughts?

Not entirely what you’re asking, but we sometimes do something similar to this using plasmode simulation. The idea being that you bootstrap your sample and then re-assign exposure/outcome using known treatment/outcome models. Then you test your proposed methods to see how they perform in your dataset when you know the truth.

I might be totally misunderstanding what you’re asking though!

Thank you Tim for the paper. It’s interesting. The bootstrapping (with replacement) is something I have utilised in the past. Maybe a very simple illustration will give you an idea why I am thinking about alternatives to a bootstrapped data set.

Consider a data set with one variable X and outcome Y.
When Y = 1: the X values are : 5, 50, 75, 77, 80, 120, 200, 230, 400
When Y = 0: the X values are: 1,1,2,3,4,4,5,5,6,6,7,9,10,12,12,25, 30,34, 45, 55, 60

I’m interested in the extremes of X, particularly the “5” when Y =1. If I was to bootstrap data sets of the same dimension then 2/3rds of them would contain the “5” with Y=1 (at least once). However, there would be none with, say, X = 20 and Y =1. Yet, intuitively, in the population I’m trying to understand there may be a measurement of X = 20 when Y = 1. What I am interested in is simulating data sets that could contain such a combination.

Ah ok this makes more sense. There is an indirect comparison method sometimes used to support HTA submissions called simulated treatment comparison that might help. The purpose of the approach is to use your own IPD to make comparisons between you and another technology in the population that is more relevant for decision making. One approach to this that is used is to just take the varcov matrix of your dataset and then simulate a profile of patients using mvnorm mean shifted means/sds to get the population profile you’re looking for.