I have become a great believer in simulating data in order to pre-specify models and ensure your model can actually model what you’re setting out to. McElreath takes this to its natural conclusion using Stan and R in his book. Harrell talks about its importance in his course. However, apart from the Bayesian approach in McElreath I can’t find a good guide on how best to actually do it. My qestion is: Is there a good book, site or references on the methodology, and pitfalls and pearls? Perhaps McElreath is it, but what if Stan is not an option? Open to any suggestions at all. Kind regards.
Late to the thread. I am all for using simulation to study statistical and machine learning algorithms.
Teaching statistics and machine learning could benefit enormously from using more simulation. Just turn it all upside down. Start simulating realistic datasets, and figure out how the different methods and algorithms perform.
What happens if the variances are not equal across groups? Find it out! What happens if the data is skewed? What if a time series has a drop in level? And so on.
I think that studying data is in a way way more useful than studying algorithms. Start recognizing data, and then learn how the algorithms deal with it, or not.
Another good feature of that approach is that you can simulate datasets of any size, and start with a large N. Progressively decrease N until your new great algorithm breaks. Gold standard performance of an algorithm is assessed by simulating a separate huge dataset for validation.