Simulating data - reading suggestions

sadneurons · December 22, 2022, 4:55pm

Dear DataMethods community,

I have become a great believer in simulating data in order to pre-specify models and ensure your model can actually model what you’re setting out to. McElreath takes this to its natural conclusion using Stan and R in his book. Harrell talks about its importance in his course. However, apart from the Bayesian approach in McElreath I can’t find a good guide on how best to actually do it. My qestion is: Is there a good book, site or references on the methodology, and pitfalls and pearls? Perhaps McElreath is it, but what if Stan is not an option? Open to any suggestions at all. Kind regards.

eliast · December 22, 2022, 5:14pm

this book by Gelman et al. has a section on this with some examples.

f2harrell · December 22, 2022, 8:05pm

For pure programming stuff see R Workflow - 18 Simulation

LRSamuels · January 6, 2023, 4:57pm

“Using simulation studies to evaluate statistical methods” by Morris et al. might be the kind of thing you’re looking for (https://onlinelibrary.wiley.com/doi/10.1002/sim.8086).

nfultz · January 7, 2023, 4:54pm

The DeclareDesign book has some worked examples with code in the latter half.

kiwiskiNZ · January 16, 2023, 8:11pm

If you are working in R and want to create simulated data sets based on actual data, then I’ve found the package synthpop quite useful.

sadneurons · January 27, 2023, 10:19pm

Replying to my own message may be a faux-pas, but I’ve discovered a couple of resources that may be of use to others on this journey:

The Simulation Summer School with videos here: https://www.youtube.com/playlist?list=PLvv6KTS5tb3pHALXIq0mSwHnZY1SPIBLR

AND

Andrew Heiss’s web course here: Program Evaluation - The ultimate guide to generating synthetic data for causal inference which has a causal focus.

Hope these are helpful for others

Ross

Sanne · November 16, 2023, 4:16pm

Late to the thread. I am all for using simulation to study statistical and machine learning algorithms.

Teaching statistics and machine learning could benefit enormously from using more simulation. Just turn it all upside down. Start simulating realistic datasets, and figure out how the different methods and algorithms perform.

What happens if the variances are not equal across groups? Find it out! What happens if the data is skewed? What if a time series has a drop in level? And so on.

I think that studying data is in a way way more useful than studying algorithms. Start recognizing data, and then learn how the algorithms deal with it, or not.

f2harrell · November 17, 2023, 12:38pm

Another good feature of that approach is that you can simulate datasets of any size, and start with a large N. Progressively decrease N until your new great algorithm breaks. Gold standard performance of an algorithm is assessed by simulating a separate huge dataset for validation.

Sanne · November 17, 2023, 2:10pm

Yes, a lot of algorithms work with unlimited data.