Cynthia Rudin, who works in making interpretable machine learning models has this new paper about Leo Breiman’s famous two cultures paper. I thought it was insightful, although I am not sure I understood the bit where she talks about how the Rashomon Effect itself suppresses the accuracy/simplicity tradeoff. Id love to hear thoughts about this.
Nice article. I’m in agreement with it. I’m in the small minority of statisticians who think that Breiman’s article has not aged well. I say that mainly because his idea of a data model is quite narrow, and he did not foresee how flexible data models would become. Think of a semiparametric Bayesian regression model that assumes nothing about the distribution of Y that has standard (or no) priors for the main predictors, horseshoe priors for 2000 candidate protein signatures, and Gaussian shrinkage priors for all 2-way interactions among the main predictors. You get get a highly efficient analysis from models such as this, can pinpoint insertion of prior information, get full statistical inference, and still be able to interpret everything. You can also handle left, right, and interval-censored Y and missing data. Also, Breiman’s opinion that statisticians did not care about predictive accuracy was not true since the mid 1980s.
This is a beautiful paper - thank you for sharing! It expands on a very useful distinction between “noisy” and “non-noisy” problems that I first encountered in this 2017 article by @f2harrell and have been using ever since (see also here).
The argument on the incompatibility between the accuracy/simplicity trade-off and the Rashomon effect appears to come from complexity theory and can be boiled down to the idea that the larger the Rashomon effect (i.e., the higher the diversity of models that can near equally well explain the same data) then the more likely it becomes that simple models will be included among them.
I had several opportunities to talk to Leo Breiman. On one occasion I noted to him that he had been an undegrad physics major, and that a third culture–that of the theoretical physicist–could be added to his Two. For example the work by Planck and Einstein at the very beginning of quantum physics consisted of both of them analyzing other people’s data…something that statisticians do, but Planck and Einstein did not do anything recognizably “statistical” or “algorithmic”, either of which would be called “phenomenological physics” today. Rather they were doing theoretical physics. Neither a statistical data modeler nor a machine learner would have come up with energy quantization from the data that were given, but Planck and Einstein did it. (Since then I’ve written at greater length about this elsewhere.) Leo didn’t disagree with me, but he didn’t seem to think I was making an interesting point, maybe because for him it was too obvious to bother making, or because physics was too different from the empirical fields that statisticians or machine learners were asked to work in back then. My conversation with Leo did not extend to the Rashomon issue, but it can occur in theoretical physics as well, and in well understood applications (many in optics, for example, where you could model light as a ray, a wave, or a photon) it’s straightforward how to cope. (In string theory, there is currently no way to cope.)
Incidentally Rudin is scheduled to give a talk on Rashomon sets at JSM Tuesday morning, for those attending.
With regard to Frank’s comment, I studied for my master’s degree in 2000-2001 and was taught in the manner Breiman criticized - I was taught to assess model “goodness of fit” internally and not with methods such as cross validation, bootstrap, or data splitting. There were statisticians who cared about such things since the mid 1980s (David Freedman, for example) but not the majority, and certainly not those who gave me my degree. In fact, Frank’s RMS helped me realize just how flawed my “official” education had been. So, Frank, your last sentence might be correct in a literal sense, but still unfair, because people like you and Julian Faraway were discussing these things in your books, at a time when other stats textbooks neglected them.
Good paper. I enjoyed a previous paper too “To predict or to explain?” from years ago. It’s good there are people in the ML community to raise these important points.
Sec 2.3 states:
In practice, “non-noisy” problems are completely different from “noisy” problems. Non-noisy problems have deterministic labels: given x, y is not random. Noisy problems have random y given x. Many of Breiman’s problems were noisy data problems. But many machine learn- ing researchers worked extensively on non-noisy problems. Computer vision problems, which were being studied extensively by many machine learning researchers, are not noisy – if an image x contains a chair (the label y is “chair”), it always contains a chair. I would venture to say that these two problem classes define two unnamed subfields of machine learning, because everything about how we approach the two classes of problems is different
ML and AI have had their greatest successes in high signal:noise situations, e.g., visual and sound recognition, language translation, and playing games with concrete rules. What distinguishes these is quick feedback while training, and availability of the answer. Things are different in the low signal:noise world of medical diagnosis and human outcomes. A great use of ML is in pattern recognition to mimic radiologists’ expert image interpretations. For estimating the probability of a positive biopsy given symptoms, signs, risk factors, and demographics, not so much.
In 3.1 she writes:
Breiman claims it was a common practice of statisticians to develop models without consideration of either training or test accuracy. A machine learning scientist would never omit these calculations, but a data modeler could
One aspect that I think is not explicitly recognized in Rudin’s paper is the fact that many of the problems statisticians work on do not have large sample size, which precluded the test/train split that is ubiquitous in ML: who would want to decimate an already small dataset by splitting it into test/train? After all, CV had existed since the early 70s, and so it wasn’t the lack of methods that precluded these practices but rather the fact that they were not practical for smaller sample sizes.
I don’t think Breiman was right even there. Many of us were doing unbiased validations of model performance measures at the time Breiman wrote his famous paper.