I am planning to do time-to-event analysis. The event is mortality. The outcome is obesity w/wo metabolic risk factors. N=16,000. My first idea was to use traditional survival models such as Cox PH. However, digging into literature regarding both the health topic and time-to-event analysis, I came across decision trees, and specifically TSSA. The way I understand TSSA is that it will be possible to generate subgroups based on the data, which have differing mortality rates. I.e. decision trees. Hence, it will be possible to examine which of the risk factor combination carries the most risk of mortality, and allow for more complex and data-driven interactions, non-linearity etc. On the other hand, I cannot find many studies using this method and I also fear overfitting.
- Does anyone have any experience with decision tree vs regression survival analysis?
- Which package in R could be recommended?
Those are not decision trees but are classification and regression trees (CART), also called recursive partitioning. The reason that bagging, boosting, and random forests exist is because of unsatisfactory, unstable performance with such trees. They assume that continuous variables have discontinuous effects with easy to find points of discontinuity, among other things. They require enormous samples sizes (much larger than 16,000). The easiest way to see this is to note that in the best possible situation, the sample size needed to estimate an interaction is 4\times the sample size needed to estimate a main effect, and trees are searching for lots of interactions. Ironically, it is seldom the case that such found interactions are validated.
Single trees are appealing and interpretable only because they are misleading. I go into this a bit more in my RMS course note.
Thanks a lot for the clarification and fast, informative reply! It makes sense that single trees, which are easily interpretable, are misleading, especially with a relatively small sample and mostly continuous variables. This was my main worry - modelling of noise and not actual relationships. What about random forest survival analysis? I get the impression from your notes that these methods are more stable and better in terms of prediction. But interpreting them is difficult, or impossible, which then again makes them undesirable from a classical epidemiological point of view where the intention is to explore, explain and interpret relations between variables - not mainly to predict.
To fix the problems with single trees, using methods such as random forests, you have to make the results completely uninterpretable, so trees then lose all their advantages. So I would stick to well-specified, flexible (without assuming linearity) regression models.