This is a place to discuss Roadmap for Choosing Between Statistical Modeling and Machine learning.
Archive
These comments and replies were made in 2018 using Disqus
on a previous platform used by fharrell.com
.
Danilo Orlando: “to sharpen the discussion by having a somewhat concrete definition of ML as a method without “specialness” of the parameters, that does not make many assumptions about the structure of predictors in relation to the outcome being predicted, and that does not explicitly incorporate uncertainty (e.g., probability distributions) into the analysis”
This leaves me really perplexed… Bayesian models do explicitly incorporate uncertainty.
Frank Harrell: I’m not clear on the dilemma. Bayesian models are not ML they are SMs. And I was referring to basic uncertainties about the outcome variable, e.g., modeling tendencies (probabilities) instead of making choices, plus estimating uncertainties in parameter estimates (in either the frequentist or Bayesian sense).
Danilo Orlando: Ok , if you do not consider LDA, Gaussian Processes or even a simple Bayes Classifier as “part of ML” . I just think you draw a too hard line in this regard.
FH: Guilty as charged. I’m trying to draw a somewhat hard line to facilitate thinking and discussion. Gaussian processes are good examples of how this can get unclear. As an aside, I’d like to see an example where a simple Bayes classifier is compatible with optimal decision making. I’ve never heard anyone claim that linear discriminant analysis (LDA) is part of ML. LDA is as statistical as you can get, and was obsoleted by logistic regression for most purposes. Note that you can get logistic regression using Bayes’ rule to invert LDA. LDA assumes multivariate normality of features, and the posterior probability of class membership given multivariate normality is exactly the logistic model.
Danilo Orlando: Sorry for LDA I am talking about Latent Dirichlet Allocation. Linear discriminant analysis is however extensively used and taught in any standard machine learning course . The same way logistic regression is considered as part of ML in almost any context. And this is why this hard distinction , again , in this regard, is for me excessive (but I understand the pedagogic use if you want).
I am not sure I quite understand your question about Bayes classifier, which is optimal for minimising classification error (or for instance a 0-1 loss function).
FH: Thanks for the clarification. The Bayes classifier as you described it is not Bayes, because full Bayes would have a utility function to optimize and would never use simple classification error as the utility function. Regarding logistic regression and linear discriminant analysis being called ML methods, let me just go on record as saying this is patently ridiculous and should never have happened. My blog article was in a part a correction to the record.
Danilo Orlando:
- 0-1 loss is a utility function…
- even there it would counter your argument that there is no probability distribution over the outcome.
- I would personally refrain from using terms as “patently ridiculous”, considering these distinctions are subjectives in many ways. But that is not my problem. I appreciate other points in the article.
FH: 0-1 loss is a utility function - one of the worst ones ever invented, staunchly at odds with optimum decision making and especially with deferring decisions when information is insufficient. In effect it assumes there is no probability associated with the classification, and in most cases effectively assumes that the outcome does not have a random component. Details at Classification vs. Prediction | Statistical Thinking
Drew Levy: “… the art of data analysis is about choosing and using multiple tools.” [Regression Modeling Strategies, pp. vii ]
Frank’s post does us the favor of providing a contrast of SM and ML in terms of fundamental attributes (signal:noise and data requirements, dependence on assumptions and structure, interest in “special” parameters, accounting of uncertainties and predictive accuracy). This is clarifying perspective. Despite the prevalent conflation of SM and ML within the rubric of ‘data science’, Frank’s post underscores that SM and ML are different in important ways and the individual considerations in this contrast should assist us in making deliberated decisions about when and how to apply one approach or another. This cogent set of criteria help us better select tools that are fit-for-purpose and serve our particular ends with the best means. Getting clarity about what our real ends are might be the harder part.
To extend the analogy, the guideposts identified by Frank could be illustrated as a route map if put into the format of a series of junctures (and terminii); here is an example:
- Do you want to isolate the effect of special variables or have an interpretable model? If yes: turn left toward SM; if no: keep driving …
- Is your sample size less than huge? If yes: park in the space designated “SM”; if no: …
- Is your signal:noise low? If yes: take the ramp toward “SM”; if no: …
- Is there interest in estimating the uncertainty in forecasts? If yes: merge into SM lane; if no: …
- Is non-additivity/complexity expected to be strong? If yes: gun the pedal toward ML; if no: …
Of course this is a farcically simplistic cartoon, and the situation is certainly much more nuanced than this. And there are surely other maps that people could draw and elaborations to make. There are also surprises lurking in the landscape.
Notwithstanding their occupying diametric points on various proposed spectrum (e.g., ); the issue of a false dichotomy is moot: ML and SM are different. A better question is, are there conditions and ways in which they can be complimentary for specific purposes? Are there ways they can be combined? Are they compatible within the domain of modern applied practice? In the general domain of practice SM and ML only displace one another in a perspective of chauvinistic zero sum domination. They only compete if their respective advantages under specific conditions and for specific purposes is not understood. They only compete under conditions of prejudice or incomplete understanding.
Franks roadmap helps resolves this.
Lauren Saag Peetluk: First of all, thanks a bunch for coming to my proposal this morning - always great to hear your questions and feedback. In mulling over your suggestion to consider bigger models and more unsupervised learning techniques, I found myself here.
It seems to me like the data and question I have lend themselves almost exclusively to statistical modeling framework (non-large signal:noise ratio, imperfect raining data, isolation of small # of variables, relatively small sample size, isolation of “special” variables a.k.a. HIV status and severity, and interpretable model). And I’d like to think that the predictors have largely additive effects, though that might be a stretch.
Which brings me to my question (and apologies if I am mixing two unrelated concepts), are there unsupervised techniques within statistical modeling that I should look into? From what I’ve understood, unsupervised learning exists within machine learning - and in that case, I am skeptical of how to justify its use with my project. Though, I also recognize the limitations of model selection via lasso or backwards elimination as they induce uncertainty and effectively limit the sample size, but am unsure of what alternative would make any more sense.
FH: Hi Lauren - great question. My statistician-biased opinion is that unsupervised learning had its birth in statistics and psychometrics, starting perhaps with principal component analysis. PCA, variable clustering, sparse PCA, factor analysis, correspondence analysis, sliced inverse regression, etc. all can play major roles in the case where the number of candidate features is large and there are some correlations among features. The spirit of unsupervised learning is that instead of making close calls in selecting individual features, by trying to separate features that are hard to separate, give up and don’t try to separate them. Combine features according to observed relationships, and estimate more general effects when playing cluster summary scores against Y. For example we might say that a risk factor history score relates to incidence of cancer, and such a score might have both age and cigarette smoking in them because smoking increases with age and so is hard to disentangle from age.
So in a setting where one might be tempted to use the lasso on 1000 candidate features, I’d argue that the result is highly unstable and unlikely to validate. The interpretation of the selected features may be simpler, but it’s simpler only because it is wrong. Instead, reduce the 1000 features down to a few dozen clusters to relate to Y.
Deepshikha: Can anyone help me understand what “additivity” means in SM?
FH: It’s simplest to talk about with ordinary multiple regression models, aka linear models. The additivity assumption is the assumption that effects are additive and they do not interact, e.g., two predictors do not multiply each other in the model equation. Suppose that a model had two predictors: age and height. An additive model would assume that the predicted mean value is an additive combination of age and height, i.e., that these two effects can be separated. Such a model might resemble Y = b0 + b1 * age + b2 * height if age and height both had linear effects on Y.
Randy Bartlett: Any Machine Learning for data analysis is necessarily Statistical Modeling. We need another term for non ML Statistical Modeling and what would that mean anyway.
FH: I can’t disagree. But random forests and other related tree methods are not statistical models in my view.
Randy Bartlett: Thank you for your reply. We use random forests, et al. on statistics problems. Statistics problems have statistics assumptions. I think a narrow definition of ‘statistical models’ is fine for tool builders and detrimental in the field. It suggests that a ‘modeler’ does not need statistical thinking or training. We are seeing a rise in statistical malfeasance and a problem-based labeling would help a great deal.
FH: As described in my article, it’s clearer to define methods by their characteristics and not by characteristics of the problems to which you apply the methods. I can go either way on calling random forests a statistical method. But it’s not a statistical model. It is an algorithm without data distributional assumptions. So it is much more machine learning than a statistical model.
I’m all for a broad definition of statistical models. I just think that tree methods are much more model-free than most methods.
Randy Bartlett: I get it and I respect that you want the terminology to match the morphology and taxonomy of the tools … science, clarity, order. I appreciate the idea of calling random forests statistical methods—as in ‘methods’ relates better to practice. It would behoove us to insist that ‘stat’ be in the names of all tool sets used for data analysis. Any term void of the ‘stat’ root word is read by nonstatisticians/antistatisticians as ‘we can perform data analysis without
knowing statistics’—see and try to believe ‘Data Mining for Business Intelligence’ (book title).
On the other side of this, is a fiasco—a society with a low statistical literacy, misinformed by rising statistical malfeasance, sold by charlatans and those with good intentions, who claim that they do not need statistics to perform data analysis because they have ML (tool sets with names void of the ‘stat’ root word).
As you might be alluding to above, social media has started listing regression as an ML tool.
FH: I wish I could have stated those issues as well as you did Randy. Fantastic. I need to remember how often I see people screw up the most basic choices about descriptive statistical summaries. If they can’t do that right there’s little hope of them getting complete things such as ML right.
Randy Bartlett: Mock News: Smith & Wesson is discretely buying up vending-machine space in high schools across America in preparation for a new product launch. It is called a ‘plastic toy.’ It has the properties of a hand gun without the accuracy and reliability. Do not confuse these plastic toys with guns though; the latter are made of metal and require safety training.
Joking aside, statistical malpractice is a huge problem. In addition to the obvious reduction in career opportunities for graduating statisticians, bystanders do get killed by other people’s plastic toys.
Thank you, Dr. Harrell. Your book is the greatest.
Bill Harris: Is the additivity constraint a bit less important than described here? Can’t non-additive models often be made additive by a log transformation?
FH: An excellent question. I’m presupposing that Y is optimally transformed [actually I use semiparametric models so this doesn’t matter] when using parametric regression so that the right hand side of the model has the highest chance of being additive. Models after optimal Y-transformation can still be non-additive (i.e., have interactions), and it is those interactions that are interesting with respect to the subject matter. Short answer: the additivity assumption is important, but luckily after proper Y-transformation things are more additive than not.
JoAnn A: This article points out important points that are poorly understood in both ML and statistics communities.
I will add that one thing that cannot be done with ML, that people try to do, is to “draw insight” from a ML algorithm that was not carefully planned for that purpose, and without a trained statistician who understands the need to examine variance of estimates, confounding, and a host of other issues.
However, I disagree that “regression models are not ML.” Many ML models are generalizations of regression. For example, linear regression is a “special case” of neural networks. GLMs can be constructed for make inferences, and they can also be “trained” to make predictions. Machine learning gurus consider logistic regression an important ML algorithm.
FH: Thanks for your good points. On “regression models are not ML” I still hesitate to fully agree. However, one thing that backs up your statement is that you can write neural networks as regression models with high-order interactions. But the training and when to stop training are mainly based on ideas from computer science and not statistics. So I’d still like to separate ML from regression, at least most of the time.
Greg Timpany: I have come to realize there is a time and place for everything. As Professor Harrell highlights, statistical models outperform machine learning in certain situations. At other times it is preferred to let the ML algorithm crunch its way through the data. What I appreciate most is articles like these that add to the body of understanding so that we practitioners can more wisely choose which tool to use.
James Beckman: Wonderfully complete! Some relatives of mine would have revelled in the fit for much medical & engineering data. I, however, do voter or buyer reactions to various kinds of offers. These are very sensitive to immediate physical setting & current conceptual issues. I have found the easiest way to approach these is by current comparisons in time/place, as well as some standard methods in public speaking. These, then, are time series with perhaps six variables placed on top.