Rebranding of "Statistics" as "Machine Learning" to Sound More Impactful & Negative Fallout

ADAlthousePhD · January 2, 2019, 3:22pm

This article recently came to my attention:

It is my opinion that calling this a “Machine Learning” approach is yet another example of stretching the term in a naked effort to make a paper seem more sophisticated. The “machine learning” approach here, chi-squared automated interaction detection, has been around for almost 40 years and seems nearly identical to classification and regression tree analysis / recursive partitioning (I realize that some of these terms are used interchangeably depending on where you trained or what specialty you work in). Applying this technique to a dataset with 140 participants and 38 events has virtually no chance of providing meaningful or reproducible results, but by slapping the phrase “Machine Learning” on the paper, the authors managed to turn this into a fairly high-impact publication (this looks to be one of the top 5 or so journals in the Sports Science / Exercise Physiology world).

I wonder if next year’s BMJ Christmas Issue could feature a collaborative effort between some folks that are similarly flummoxed or peeved by the exploding use of the term “Machine Learning” to describe quantitative approaches that may represent a misleading use of the term. One idea would be some sort of “systematic review” of articles published in selected journals in 2018 that used the phrase Machine Learning, to break down precisely what technique(s) were used and called Machine Learning, with some accompanying commentary from an expert on what we really should be calling “Machine Learning” versus what is simply re-packaged or re-branded statistics (I would love to see how many papers call “logistic regression” a “machine learning” technique). Anyone interested in working together on this? If so, I’m happy to get a group of us together and start to hammer out parameters for what we should write about and how.

f2harrell · January 2, 2019, 7:08pm

Extremely well said Andrew. On the technical aspect, automatic interaction detection (AID) was soundly discredited in the classic book on categorical data analysis of Bishop, Feinberg, and Holland decades ago. They provided an example where a beautiful tree resulted from feeding in the AID algorithm nothing more than random numbers.

All this raises the issue of why journal editors and reviewers are so methodologically naive.

numbersman77 · January 3, 2019, 4:47pm

I could not agree more, very well said!

I make the same experience in a corporate environment, where everything that carries the label of “statistics” appears “static, boring, old-fashioned” irrespective of how innovative it actually is, whereas using the same (and often much simpler!) methods but labeling them ML, AI, or whatever gets everyone excited.

IMHO, part of the confusion comes from a lack of a clear definition of what ML (and AI) actually should be (or maybe not lack of definition, but a strongly “scientific-field dependent” definition). To me, a trained statistician, lasso is just a regression with a criterion function different from usual LS, but too many it appears lasso is “ML”.

Your proposal is very good. Does BMJ also have an Easter issue? The earlier such an article would be written the more helpful…

ADAlthousePhD · January 3, 2019, 5:05pm

Agree - this is certainly what we are seeing, the phrase Machine Learning is now a ticket to higher impact publication because, as Frank said, reviewers and/or editors may not have sufficient expertise to critically evaluate these methods. I’ve seen the following joke applied to Big Data, but it can apply to ML as well:

“it’s kind of like teenage sex, everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

I’m quite certain that some folks, enamored with the idea of Machine Learning or Big Data, are now eagerly sticking that phrase on whatever they’re doing because they know it inspires awe in readers these days.

Just to be clear - I referenced the BMJ Christmas issue because they publish humorous and/or spoof articles (this year’s PARACHUTE trial being an excellent example):

https://www.bmj.com/content/363/bmj.k5094

They only publish “spoof” articles like that once a year. I was considering some sort of cheeky article like “Machine Learning in the Medical Literature: A Systematic Review” in which we could tabulate all of the articles from selected journals in a defined period (calendar year 2018 would suffice) that claimed to use “Machine Learning” and (somewhat sardonically) categorize the analytic approaches for which the authors invoked the term “Machine Learning.”

I am not a “Machine Learning” expert and certainly not the proper person to write a paper which attempts to define the term ML (and it’s cousin AI) placed in context of more classical statistical approaches. Papers like that must already exist - it’s just that folks who want to call their paper “Machine Learning” are going to do that no matter what the authorities say (something I have bemoaned here, on Twitter, and everywhere else is that just because someone has written a really good Methods paper or Perspective piece doesn’t mean everyone is going to pay attention).

If the task is writing a serious paper that attempts to define ML, I would recuse myself from that task because of a lack of appropriate expertise. I would, however, enjoy participating in a “Systematic Review” described above where we select a group of high-level journals and catalogue all references to Machine Learning in a relatively brief time period (if we are submitting to the BMJ Christmas issue, the purpose is to make a point using some humor, not necessarily achieve top methodological rigor, so we don’t need go to through 10,000 articles here).

A search of the term “machine learning” in PubMed for 2018 (with no other restrictions) turns up 5756 hits. Restricting to the phrase “machine learning” appearing in the title reduces that number to a somewhat more manageable 1588 hits. That seems a manageable enough number to review (especially since, once whatever group agrees to participate has a few more defined criteria, we may discard some of those). If we start with that number and focus on papers which claim to have used Machine Learning to “predict” a specific outcome (such as the paper referenced in the first post), that may be a good direction to take this project.

f2harrell · January 3, 2019, 5:45pm

I tried to address that here.

venkmurthy · January 6, 2019, 9:06am

Really important idea!

Happy to help.

Here are some ideas that spring to mind:

Tabulate use of machine learning as proportion of all papers in journals publishing at least 1 ML/AI paper
Average age of methods used
Whether these are just standard statistics rebranded
Sample size (overall and events)
Journal impact factor
Sub-discipline
Positive/negative endpoints
Correlations among the above

Cherry on top: analyze 8 using standard stats and also with ML!

f2harrell · January 6, 2019, 2:29pm

I have a general question about the value of review articles. One of the most impactful was a review of published RCTs that showed definitively that effect sizes used in power calculations are too large (or power calculations weren’t even done) so that RCTs are on the average underpowered for detecting the right things. Yet this problem remains after that article was published decades ago. Is it worth the effort?

My personal belief is that we should volunteer for doing more peer review, write blogs, tweet, and embarrass journals over some of the articles they are publishing.

venkmurthy · January 6, 2019, 6:24pm

Would love to read that review sometime!

I do agree with you that one paper like this will not change the tide. That said, I would hope many small steps would have two benefits:

(1) Reduce the hype
(2) Educate people that statistics goes beyond “standard” commonly used methods

ADAlthousePhD · January 7, 2019, 2:56pm

Temporarily leaving the initial topic to riff on this for a moment…

That is a very fair point. Another illustration of this is the remaining prevalence of p-values in Table 1 of RCT papers despite published articles by Altman, Senn, yourself and others published beginning in the 1980’s and periodically since explaining why this is nonsensical (one of the frustrating things about people tsk-tsking those of us who chat on Twitter with “This doesn’t seem professional. Why don’t you just write a letter to the editor expressing your concerns?” - clearly those folks haven’t seen what happens after most LTE pointing out methodological flaws, which is to say, nothing).

I wrote a few letter-to-the-editor type articles early in my career (when I was still figuring some things out…) in reaction to specific problems that I encountered, either in publication. the review process, or just in discussion with folks - the most humorous being an encounter with a group using time-to-event analysis for a composite endpoint (MACE, some combination of death, MI, stroke, etc) but wanted to change the “time” to match the “worst” event that the patient had experienced.

Just to be clear, because you might think they were referring to some legitimate use of hierarchical or ordered endpoints: these people truly wanted to classify the composite event as yes/no but use the “time” of the worst event in the list - if a patient had a stroke at 6 months and died at 12 months, they wanted to call the patient an “event” at 12 months (which was the worst event, but not the first occurrence of something included in their composite). I finally got the point across by showing that this approach would make a patient who had a stroke at 6 months and died at 12 months look better than a patient who had a stroke at 9 months and lived to the end of the study - under their desired approach, the former patient would be “event” at 12 months while the latter patient would be “event” at 9 months.

I wrote a brief letter about this which was accepted for publication:

https://www.sciencedirect.com/science/article/pii/S0167527315303004?via%3Dihub

This was before I joined Twitter, but I would guess that this article has barely been noticed by anyone unless they happened to be doing a search for something about survival analysis. Which leads to…

Twitter comes with the advantage of more direct and immediate feedback (versus a published article), but it definitely feels like I’ve done more in getting people to read, think, and ask questions through that medium than any letter to the editor or methods article I could write (in many cases directing them to Frank’s blog or, now, this website). Clearly there is a role for both formal publication and blogging / tweeting / discussion threads such as this forum. You make a good point that such an article may fall on deaf ears - in fact, the reason I considered the “BMJ Christmas Issue” approach rather than a more formal article is that I’m quite sure a formal review of the topic would be ignored by those we are trying to reach - circling back to the original topic of this thread, the people I want to reach are the growing faction that publishes long-since-established-or-even-discredited methods dressed up as “Machine Learning” to trumpet that their findings are novel and/or impactful when in reality the paper inspires very little confidence that the results are reliable or actionable.

I don’t know that it would be the best use of time to write the gag review article that I’ve described…

Agree on (1) for sure. As for (2), yes but - not only that statistics goes beyond what many people realize, but that “Machine Learning” has become such an overused buzzword that it’s an express ticket to publication and in some cases news-reporting, regardless of the quality of the work.

This is in no way meant to put down the concept or legitimate applications of “Machine Learning” but rather to shame people that have figured out they just need an excuse to put the words Machine Learning into a title to get their paper published, cited, etc when that is an absurdly generous description of what they have done.

jkb · January 7, 2019, 9:28pm

Thanks for the link to this article. Suggestions on where/when to use ML/SM will be used as back-up in my discussions with collaborators. I’m a Maths/Comp Sci hybrid - my experience, sweeping generalisation - is that those who are stats-focused consider a question or hypothesis, biological or clinical, whereas others who are more comp sci focus on the methods. I’d be happy to contribute to a review exercise. Cheers, J

Tom_Oates · January 8, 2019, 8:25am

This is a great thread. As a clinician, and definitely not a highly trained statistician, who knows enough to be able to call ‘foul’ on most ML/AI claims in recent medical literature, I think this would be so useful.

I think a lot of medics have a folk view of ‘statistics’ that only encompasses a handful of things such as null hypothesis significance testing and Kaplan-Meier plots, so I am sure many of us would appreciate the chance to be educated by the kind of Christmas BMJ article @ADAlthousePhD suggests.

Also there’s another good quote:

Write grants with artificial intelligence, employing people with machine learning, publish with linear regression

ljubomir · January 8, 2019, 8:40am

Dear Dr. Althouse,

One of your points is that “rebranding” logistic regression as machine
learning is wrong, and is done in order to make one’s research sound
better. I’m sure there is some of that, but I wonder if it is that
simple in general… let me explain why by way of a real-life example
from my work.

My colleagues and I are developing a predictive model for sepsis using
gene expression data. The goal is to develop a classifier which
predicts whether a patient has sepsis based on the gene expression
measurements in white blood cells. To that end, we analyzed our
training data using a pipeline which includes cross-validation,
ensembles and hyperparameter search as key components. We used four
classifiers within the pipeline: XGBoost, neural network (multi-layer
perceptron, MLP), Support Vector Machine (SVM), and logistic
regression (LR). Our goal is strictly maximizing accuracy on future,
previously unseen patient data. In particular, we are not performing
strictly statistical analyses like hypothesis testing. I think it is
fair to say that the first three methods (XGBoost, MLP and LR) belong
firmly to machine learning domain. Logistic regression, as you stated,
belongs to statistics. If that is so, is our work 75% machine learning
and 25% statistics? That is hard to believe/accept… because the
pipeline that we use is 99% common for all four models. The only
difference is one line of code, whereas we invoke either XGBoost, MLP,
SVM, or LR learning API. Also, if we dropped LR, would our work then
become 100% machine learning? It seems strange that changing a single
line of code (out of thousands) would shift the scientific nature of
our analysis

Bottom line, it seems to me that the use of logistic regression model
may legitimately be considered predominantly statistics (where it
originated, before machine learning existed) or predominantly machine
learning, depending on how it is used.

Looking forward to different viewpoints

ADAlthousePhD · January 8, 2019, 1:11pm

ljubomir:

My colleagues and I are developing a predictive model for sepsis using
gene expression data. The goal is to develop a classifier which
predicts whether a patient has sepsis based on the gene expression
measurements in white blood cells. To that end, we analyzed our
training data using a pipeline which includes cross-validation,
ensembles and hyperparameter search as key components. We used four
classifiers within the pipeline: XGBoost, neural network (multi-layer
perceptron, MLP), Support Vector Machine (SVM), and logistic
regression (LR). Our goal is strictly maximizing accuracy on future,
previously unseen patient data. In particular, we are not performing
strictly statistical analyses like hypothesis testing. I think it is
fair to say that the first three methods (XGBoost, MLP and LR) belong
firmly to machine learning domain. Logistic regression, as you stated,
belongs to statistics. If that is so, is our work 75% machine learning
and 25% statistics? That is hard to believe/accept… because the
pipeline that we use is 99% common for all four models. The only
difference is one line of code, whereas we invoke either XGBoost, MLP,
SVM, or LR learning API. Also, if we dropped LR, would our work then
become 100% machine learning? It seems strange that changing a single
line of code (out of thousands) would shift the scientific nature of
our analysis

This is a fair point, and you invoke several methods that lie outside of my expertise, so I’ll defer to others more familiar with the respective methods.

I should clarify - I do not necessarily believe that any single method must “belong” to either “statistics” or “machine learning” - clearly there is going to be some overlap (Frank’s linked blog post above attempts to navigate this distinction; Road Map for Choosing Between Statistical Modeling and Machine Learning – Statistical Thinking) nor that a project must be called one or the other if it uses certain methods

Please do not mistake my intention in starting this thread to be another battle in a turf war of what belongs to either “statistics” or “machine learning” - rather, I simply find it ridiculous for a simple method that’s existed for nearly four decades, has known issues alluded to above by @f2harrell, that was applied to a far-too-small-to-be-reliable-or-reproducible dataset to generate a high-impact publication (this is one of the “best” journals in that sub-field) because the phrase Machine Learning was invoked in the title, and the buzzword is enough to get past many editors and reviewers BS detectors because they believe that “Machine Learning” is a magic key to unlocking insight from small datasets (which it may be in some settings, but almost certainly was not in this paper).

EDIT: something I should have added - perhaps the reason I am so fired up about this is that I feel it demeans or cheapens real work being done in/with Machine Learning. Does that make sense? When someone uses the term “ML” to describe such shoddy work, nobody benefits except the people who scored a publication.

f2harrell · January 8, 2019, 1:18pm

In my view your project got off on the wrong foot based on a common misunderstanding in the machine learning world. It is not appropriate to develop a classifier in this context, i.e., to predict “whether” a patient has sepsis. With a stochastic outcome in a setting in which there is no ground truth, it is not appropriate to classify patients as sepsis or not, as that amounts to a decision and not a prediction. I’ve gone into detail about that here. The sepsis setting is one for which what is needed is tendencies, i.e., probabilities. So I suggest you go back and turn this into a probability estimation task. If in fact you have really done what I suggest, then change the terminology to match that, using terms like probability estimation or risk prediction model.

Note also that classification accuracy is an improper scoring rule that is easily gamed.

To your main point, yes I’d say your procedure is 3/4 machine learning. On a big picture note, both machine learning and statistical models will advance when each field learns from each other. Abandoning probabilities when tendencies are of key interest is not a good example of this (if that’s what you did).

As a side issue I would love to know the predictive discrimination from differential white blood cell counts in comparison to the discrimination offered by gene expression.

ljubomir · January 8, 2019, 1:39pm

Clarifications: 1) we do perform probability estimation, not binary classification. I did not want to get into too much detail because my posting was already a bit long 2) We do not use misclassification rate as a metric. Again, I used “accuracy” as a shorthand not to distract readers. We use about a dozen+ different measures that we think are clinically relevant for diagnosing sepsis. Thank you for pointing these out

The test performance will be published when we have robust results we are ready to share

And definitely like to learn from world-class statisticians… that is why I am on this forum

ljubomir · January 8, 2019, 1:41pm

This makes sense to me, appreciate the clarifications

f2harrell · January 8, 2019, 2:01pm

Thanks for the clarification. Though “classification” is frequently used to describe the task, it is not appropriate. Using appropriate terminology is important. Machine learning practitioners seem to have made the mistake of using an active word “classification” just because the outcome being predicted was passively classified itself. More clear: instrument to predict P(Y=1) where Y = 0 or 1.

samw235711 · January 8, 2019, 7:48pm

The terminology is especially important because calling a classifier a predictive model can trick the end-user into thinking the 1 or 0 is a probability. The end-user might be a doctor making a life-or-death decision in this case.

Michael.a.rosenberg · January 9, 2019, 1:18am

I’ll avoid the classification controversy that has been raised in other forums (I’m sensing a trend here…)

From an outsiders perspective, it seems to me that the controversy about ML and statistics is not necessarily reflective of the modeling approach, as the overall goal of the model. Broadly, it seems that most models are centered around two goals, inference and prediction. Inference here means understanding how various exposures relate to a given outcome. Prediction means predicting future outcomes (or outcomes in a separate dataset). Classical statistical/epidemiology has largely focused on inference in order to understand what risk factors are more or less important for predicting risk of a given disease. In fact, the widely used Cox PH model is only good for inference, as it only provides information about how risk factors relate to each other (proportionately) over time. In contrast, many ‘black-box’ ML methods, especially deep learning, are focused only on predicting future events. Because of this, the centerpiece of ML involves data splitting, and balancing the bias-variance tradeoff of under-overfitting a model toward prediction in a separate dataset.

Which approach is better for healthcare? I’d suggest that both are needed. However, just for fun, suppose I tell you that I’ve developed an algorithm to predict who will win each NFL playoff game, and ask you to place a $100k bet based on one of two models:

One, I demonstrate how a model that includes relevant data such as how many yards the quarterbacks passed in prior games, how the defenses defend the run, and how many playoff wins each coach has over his career. I show how much the model depends on each of these factors, and the confidence interval around each based on the past 20 seasons. However, because I’ve used the last 20 years of data to build my model, I’m not exactly sure how well it predicts.

Two, I show you a model that is composed of a collection of factors that I’m not sure how they are all related to the outcome, but that I’ve tested separately on playoff games over the past 5 years and found to be 95% correct.

Which model would you use to wager? (PS, you have to wager the entire amount, no bets proportional to probability…)

f2harrell · January 9, 2019, 1:46pm

I’m afraid that you have a misconception about statistical models. Don’t feel bad; this misconception is shared by many. Statistical models are equally good for prediction as for inference about individual variables. The Cox PH model is a good example. It has been used successfully countless times for predicting survival probabilities at fixed time horizons, or predicting entire survival curves.

Estimation of individual variable effects, inference, and prediction are not as easily separated as you make it sound.

The distinction that is more applicable is that there are ML methods that do not concern themselves with interpretable parameters and don’t attempt to separate the effects of predictors. Random forest is an example. It is useful for prediction but not so much for isolating the effect of a “special” variable such as treatment or a single risk factor. With statistical models you can “have it all” but with one large caveat: the statistical model needs to have complexity that is as deep as the real relationships in the data generating process. For example, a neural network may get predictive strength from second-order interactions among predictors, but if a regression model omits interactions it will underpredict in that case.

Side issue: you mentioned data splitting. This approach to model validation has many problems unless the original dataset is enormous (say > 20,000 patients). Good model validation procedures apply equally to ML and statistical models, with more iterations needed for more complex modeling/ML procedures, in order to capture things like feature selection uncertainty. For example one may require 100 repetitions of 10-fold cross-validation to obtain sufficient precision in a strong internal validation.