I’ve been playing with the synthpop package in R and it went through my mind that I may be able to use synthetic data sets generated from my real data set in several circumstances. However, I also think there may be some issues with these and I would appreciate the wisdom of others to think through these.
Are there circumstances in which synthetic data sets could be better than bootstrapping when developing and internally validating regression models?
- one possibility I thought of is where calibration at the very low probability range (eg <1%) is particularly important. Unless one has massive data sets calibration in this range is determined by the very few events that occur despite the low prediction. For example, in risk prediction models for myocardial infarction the troponin biomarker dominates the risk prediction. In some, rare, circumstances a very low troponin concentration may be measured, yet the person is (later) diagnosed with an MI. Bootstrapping may help, but it still reflects the specific concentrations of these exceptions. However, a synthetic data set may sill have some MI with an initial low troponin, but the actual values of the troponin are likely to be spread a little. Could this improve calibration? How do you think I could test this?
Could synthetic data sets be used to develop optimal models?
- for example, I may have a data set with a couple of thousand subjects and say, a 10% event rate. If I only have 5 or 6 variable to consider, but want to test a few interactions, would it be viable to develop a number of synthetic data sets of the same size as the real data set and use them to try and determine the optimal model. The form of this model may then be run on the original data set to obtain the beta coefficients for validation elsewhere. Viable?
- a second possibility would be to create a synthetic data set much larger than the real data set and then develop a model with, say 30 degrees of freedom - ie a number that if developed in the real data set would likely result in over fitting. In this scenario then the beta coefficients would have to be taken from the model developed in the synthetic data set. I expect this is sub-optimal, but it may have its use. Thoughts?