Fair enough, and I’ll take your word for it about PH models as my understanding is that only parametric survival models allow one to make predictions about expected time-to-event for an individual.
I guess that one reason that as a clinician-investigator I’ve been drawn to ML methods is that, whether it can be done easily or not, the vast majority of observational studies using conventional statistics neglect to perform external validation. As a result, the literature is overrun with studies showing ‘significant’ associations for risk factors with disease, most of which are only later invalidated via RCTs (nutrition-science and vitamin supplement fields are among the worst offenders). My understanding of the math of many ML methods, including deep learning, is that these are generally just regression models with nonlinear basis expansion, and so it really isn’t the technique that is different from statistics, but rather the absolute devotion to external validation (i.e., prediction) over all other factors. I find this a refreshing change from the weekly research article cited by Cardiosource claiming that mixed nuts, but not almonds, are associated with incident coronary artery disease in Swedish men (’…future studies are needed to examine this risk factor prospectively…’). That some deep learning methods provide seemingly ‘magical’ capabilities, for example in voice or image recognition, is all the more exciting, although I think skepticism is appropriate for whether this ‘magic’ can be applied to clinical data.
In fairness, there are approaches to exploring the individual effect of a single risk factor in an ML model using feature importance functions (I believe that there are several methods for this depending on the ML model; one approach examines change in accuracy with that feature left out), and even for neural networks investigators are working on ways to learn more about what a model is ‘learning’ (Local Interpretable Model-Agnostic Explanations (LIME) is one application). I’d be interested to hear what statisticians think about the validity of these methods; no doubt this knowledge would be necessary before most clinicians would be comfortable with application.
Finally, I wouldn’t overestimate the need to fully understand the mechanism of a drug, procedure, or predictive model, in order for clinicians to apply it in practice. At least within my field of cardiology, the history is chalked full of medications that we only learned were effective from the impact on outcomes, and then later went back to the lab to try to understand why it works on a mechanistic level. The most effective treatment for the heart rhythm disorder atrial fibrillation is a procedure called a pulmonary vein isolation, which frankly, we only partially understand why it is so effective. If an ML model was highly accurate in prediction (and had been validated prospectively), I have no doubt that clinicians would be more than comfortable adopting it before fully understanding the relative contribution of the individual predictors.
If censoring isn’t too high, Cox can predict median and mean life length. But it can always predict incidence/survival probabilities up until follow-up runs out.
I don’t know about publication statistics but ML and stat models are not really different with respect to the need for nor the methods used for validation.
I think you’d be surprised how wide the confidence intervals are for these importance metrics. The data are often not only not rich enough to derive reliable predictions, but are not rich enough to tell you how predictions were derived.
I have doubts that ML would do better than a statistical model in that setting, unless you are doing image analysis.
The sample size needed for these is roughly the same. Poor predictive accuracy is related to inability to estimate the model’s parameters (including high standard errors for them). Overfitting is related to model parameter overinterpretation.
Hi Frank! I would like to follow-up on your concerns with classification. I am not committed to a conclusion myself, instead I’m interested to confirm or deny a few of my thoughts to advance my understanding (and hopefully that of others in the process) about the differences between classification and regression. Not from a definitional standpoint, we all know what classifiers and regressors do, but to go beyond that to exactly how do they differ or not, and what do we think about that. So I offer the following scenario and thoughts, which in some cases may suggest less of a difference, and in other cases more of a difference…
Scenario: if the only outcome (especially ground truth outcome) I have is a binary event (stroke, amputated foot, remission, readmission). and it is retrospective data, so I cannot prospectively gather something more useful, and there aren’t other useful indicators of state, and I believe that the question is still worthwhile pursuing after all of that, then…
From a classifier I can (usually) obtain the probability of error for a specific prediction (which may or may not be a true probability). This suggests less of a difference. By “not a true probability”, I mean that some of these outputs I suspect are merely squashed to the range [0,1] and are not based on any justified/meaningful probability theory. e.g., I suspect this the case with Platt’s method for support vector machines–but I will have to read it to check.
Most classifiers are regressors with a threshold (with a few exceptions–but let us avoid that distraction). The probability of error for an instance is based on the underlying classification score which is dichotomized, so that probability of error is a regression output. Again this seems like less of a difference.
However, a classifier may seek to separate (or shatter) data according to the binary outcome, which may or may not be suited to the problem which may be trimodal or otherwise. Perhaps there are competing risk outcomes (usually there are). Perhaps there is censoring of all types: left, right, interval. So this suggests more of a difference.
And when a classifier seeks to separate (or shatter) data, even as it does so with an underlying regressor, that objective of the model may be, or is, different from other regressors which are trying to estimate, not separate. So this also suggests more of a difference.
But given the scenario, unless I have the ability to redesign and execute the study, find better outcomes, or have the option to say it is not worth doing, then it seems like classification is the only option, with (pseudo) probability of error as the tendency you recommend. Is that correct? Or have I missed an option for the scenario that is glaringly obvious?! Or have I missed one or more subtle options? Can you help me understand more about the available options and concepts – not about the risk, I get that. I understand the loss of information in dichotomization and how the real question is about tendency or risk, but do I always or more often that I think, have that option?
Please let me know.
Cheers, André
As an addendum, I confounded shatter with binary separation. To shatter, as in discrimination, is clearly an objective of regressors in estimation (along with calibration, measurable interpretability/parsimony, etc).
Thanks for writing. I think the main issue is confusion in the literature over the meaning of the word classification. The correct terminology in the analytical context is to consider classification as an action. For example you might engage in classification on an image to classify it as the letter R or the letter P and not wish to admit your uncertainty in doing so. This is in distinction to something that has already occurred, i.e., that has been classified, where we are not given the option of knowing more. For example we may be given only the presence or absence of a disease and not be granted the severity of the disease.
So I take classification to mean a forced choice into a discrete bin. In the pre-modeling phrase, a classification is a discrete categorization.
In the vast majority of situations, even when handed a classified outcome, the best approach is to estimate the tendency for the outcome to be in a given category. This means probability estimation. Whether one wants to turn that into a classification, should a utility function become available, is for later consideration.
Ok, this helps. I think I am starting to understand. In my scenario, I only have a discrete outcome for supervised learning/optimization. So my (flawed) tendency is to apply a classifier, or classification as a tool not as a post-facto action, and look for the resultant class uncertainty (or class noise, or Bayes error) after-the-fact, rather than attempting to model uncertainty. So, then to model uncertainty, your ML vs Stats article says that is done by modeling data probabilistically which I took to mean a joint density (or marginals plus a covariance matrix) – but you may also mean other things. So I need to look into a catalog of ways to model class uncertainty. It is unclear to me if adding error terms in models for measurement noise (based on assumptions) counts toward this particular uncertainty objective or not, Thanks.
I think of error terms only for continuous Y. And there are many machine learning approaches that use “probability machines” instead of forced-choice classification.