Use of synthetic data in model development

kiwiskiNZ · September 27, 2019, 12:10am

I’ve been playing with the synthpop package in R and it went through my mind that I may be able to use synthetic data sets generated from my real data set in several circumstances. However, I also think there may be some issues with these and I would appreciate the wisdom of others to think through these.

Are there circumstances in which synthetic data sets could be better than bootstrapping when developing and internally validating regression models?

one possibility I thought of is where calibration at the very low probability range (eg <1%) is particularly important. Unless one has massive data sets calibration in this range is determined by the very few events that occur despite the low prediction. For example, in risk prediction models for myocardial infarction the troponin biomarker dominates the risk prediction. In some, rare, circumstances a very low troponin concentration may be measured, yet the person is (later) diagnosed with an MI. Bootstrapping may help, but it still reflects the specific concentrations of these exceptions. However, a synthetic data set may sill have some MI with an initial low troponin, but the actual values of the troponin are likely to be spread a little. Could this improve calibration? How do you think I could test this?

Could synthetic data sets be used to develop optimal models?

for example, I may have a data set with a couple of thousand subjects and say, a 10% event rate. If I only have 5 or 6 variable to consider, but want to test a few interactions, would it be viable to develop a number of synthetic data sets of the same size as the real data set and use them to try and determine the optimal model. The form of this model may then be run on the original data set to obtain the beta coefficients for validation elsewhere. Viable?
a second possibility would be to create a synthetic data set much larger than the real data set and then develop a model with, say 30 degrees of freedom - ie a number that if developed in the real data set would likely result in over fitting. In this scenario then the beta coefficients would have to be taken from the model developed in the synthetic data set. I expect this is sub-optimal, but it may have its use. Thoughts?

timdisher · September 27, 2019, 2:24pm

Not entirely what you’re asking, but we sometimes do something similar to this using plasmode simulation. The idea being that you bootstrap your sample and then re-assign exposure/outcome using known treatment/outcome models. Then you test your proposed methods to see how they perform in your dataset when you know the truth.

ncbi.nlm.nih.gov

Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases.

JM Franklin, S Schneeweiss, JM Polinski and JA Rassen, Computational statistics & data analysis, Apr 2014

Longitudinal healthcare claims databases are frequently used for studying the comparative safety and effectiveness of medications, but results from these studies may be biased due to residual confounding. It is unclear whether methods for confounding adjustment that have been shown to perform well in small, simple nonrandomized studies are applicable to the large, complex pharmacoepidemiologic studies created from secondary healthcare data. Ordinary simulation approaches for evaluating the performance of statistical methods do not capture important features of healthcare claims. A statistical framework for creating replicated simulation datasets from an empirical cohort study in electronic healthcare claims data is developed and validated. The approach relies on resampling from the observed covariate and exposure data without modification in all simulated datasets to preserve the associations among these variables. Repeated outcomes are simulated using a true treatment effect of the investigator's choice and the baseline hazard function estimated from the empirical data. As an example, this framework is applied to a study of high versus low-intensity statin use and cardiovascular outcomes. Simulated data is based on real data drawn from Medicare Parts A and B linked with a prescription drug insurance claims database maintained by Caremark. Properties of the data simulated using this framework are compared with the empirical data on which the simulations were based. In addition, the simulated datasets are used to compare variable selection strategies for confounder adjustmentvia the propensity score, including high-dimensional approaches that could not be evaluated with ordinary simulation methods. The simulated datasets are found to closely resemble the observed complex data structure but have the advantage of an investigator-specified exposure effect.

I might be totally misunderstanding what you’re asking though!

kiwiskiNZ · September 30, 2019, 1:34am

Thank you Tim for the paper. It’s interesting. The bootstrapping (with replacement) is something I have utilised in the past. Maybe a very simple illustration will give you an idea why I am thinking about alternatives to a bootstrapped data set.

Consider a data set with one variable X and outcome Y.
When Y = 1: the X values are : 5, 50, 75, 77, 80, 120, 200, 230, 400
When Y = 0: the X values are: 1,1,2,3,4,4,5,5,6,6,7,9,10,12,12,25, 30,34, 45, 55, 60

I’m interested in the extremes of X, particularly the “5” when Y =1. If I was to bootstrap data sets of the same dimension then 2/3rds of them would contain the “5” with Y=1 (at least once). However, there would be none with, say, X = 20 and Y =1. Yet, intuitively, in the population I’m trying to understand there may be a measurement of X = 20 when Y = 1. What I am interested in is simulating data sets that could contain such a combination.

Make sense?

timdisher · September 30, 2019, 11:31am

Ah ok this makes more sense. There is an indirect comparison method sometimes used to support HTA submissions called simulated treatment comparison that might help. The purpose of the approach is to use your own IPD to make comparisons between you and another technology in the population that is more relevant for decision making. One approach to this that is used is to just take the varcov matrix of your dataset and then simulate a profile of patients using mvnorm mean shifted means/sds to get the population profile you’re looking for.