I’m new here and still learning how things work. Sorry in advance if I make any mistakes.
There have always been outliers in every dataset I have worked with. Generally, I aim to examine the association between independent and dependent variables. Since these outliers were based on real data in the medical field, I chose not to remove them. For this reason, I preferred using robust regression analysis instead of multiple regression analysis.
My questions are: Are all outliers problematic for multiple regression analysis? If not, what criteria should be used to decide? And if they are problematic, should robust regression always be the preferred approach?
There was an older thread that asked a very similar question. I pointed out that there are broadly 3 options, with the important point that the proposal for dealing with outliers needs to be done before data collection.
I find nothing wrong with using robust methods. One consideration has limited their use is Thomas O’Gorman’s [1,2] observation that:
…M-estimators and R-estimators, which tend to downweight observations that are outliers, appear to be rarely used in practice. One problem with these robust estimators is that there are too many robust approaches, functions, and parameters combined with too little guidance on their correct choice [1, p.12].
He developed methods of adaptive testing and estimation procedures that peek at the data in order to adjust the weights of the observations (much as would be done in a L, M, or R-estimation), so that the model for the assumed data generation process more closely matches the actual data without causing \alpha to increase.
He accomplish this by a permutation method on the residuals, which can be proven to maintain the \alpha level. This results in a valid procedure from a frequentist point of view.
In the scenario where the researcher’s statistical model (aka the null or reference model) for the data generation process is a good fit, little adjustment will take place, and the procedure will be very close to the standard textbook methods. In a scenario where the data are more likely to be from some mixture of distributions, those data points which appear unlikely under the investigator’s statistical model for the experiment, are given less weight, which increases power.
It would be unwise to uncritically use adaptive methods without considering whether the weighting method is hiding something more serious about the conduct of study and the credibility of the collected data. At the very least, you might discover a factor to control for in future studies that you did not consider in the initial design of the research plan that lead to the collection of a particular data set.
Before even collecting any data, I would perform simulation studies where I compare how the procedure performs when the model is grossly in error, vs tolerable deviations from assumptions. You would be able to see the frequency and size of weighting adjustments under a range of scenarios, in order to quantify your uncertainty in the model, the data, and the analysis.
Addendum: The original poster asks below:
In addition to O’Gorman’s observation I mentioned above, the short, less charitable answer is: much of the literature is littered with bad statistical methods permitted by subject matter experts with meager statistical training.
As a statistician, I have been in the situation where I was a co-author, and we submitted to a Q2 journal. The (clinical) reviewer asked we do stats atrocities, even though I explained to them why it is a bad idea (with references). The reviewer insisted (e.g. literally “please show me p-values for the normality tests”) without any rationale or counterarguments. I had to abide so as not to waste everybody’s time.
My current statistical knowledge mostly comes from hanging out here (since about 2019!) and trying to reverse engineer the reasoning behind Frank’s recommendations.
There was another older thread on multiple linear regression where I made the following observation:
It is hard to beat the simplicity of Dr. Harrell’s recommendation of semi-parametric modelling as a general rule. I’m pretty surprised at how much of applied stats can be compressed into proportional odds models. This strikes me as something only someone with decades of applied stats experience would figure out.
I merely emphasize O’Gorman’s work on adaptive testing and estimation because it explicitly demonstrates sound statistical thinking, especially in the contexts where naive parametric modelling dominates. There have been threads where even the proper use of proportional odds models have come under excessive scrutiny from subject matter experts. For example:
Uddin M, Bashir NZ, Kahan BC. Evaluating whether the proportional odds models to analyse ordinal outcomes in COVID-19 clinical trials is providing clinically interpretable treatment effects: A systematic review. Clinical Trials. 2023;21(3):363-370. doi:10.1177/17407745231211272
You can see an analysis by @f2harrell of this misplaced criticism here:
I particularly appreciate O’Gorman’s proof in [1], where he demonstrates the ability of \alpha to be maintained in an adaptive test via a reference to the nonparametric Hogg, Fisher, Randal procedure.
Ultimately, serious consideration needs to be given whether a parametric or semi-parametric approach should be taken towards the data. For many clinical outcomes scenarios, the proportional odds models is the preferred approach, despite the common use of parametric methods in the literature.
For scenarios where parametric modelling is reasonable and expected, O’Gorman’s adaptive procedures are worth considering, even if the proportional odds model would give a reasonable answer, simply because:
It will be easier to defend to reviewers,
There will be a small power advantage to parametric models over semi-parametric models when parametric assumptions are justified.
References
O’Gorman, T. W. (2004).Applied adaptive statistical methods: tests of significance and confidence intervals. Society for Industrial and Applied Mathematics.
O’Gorman, T. W. (2012).Adaptive tests of significance using permutations of residuals with R and SAS. John Wiley & Sons.
Thank you for the detailed answer. I used the SMDM estimator, as it has been reported to perform better in cases with a small n/p ratio, using the robustbase package in R (Koller and Stahel). The reason I ask this question is that I almost never come across robust regression analysis in articles in my field. It made me start to wonder whether I’m missing something about outliers—are outliers acceptable in multiple regression under certain conditions? Perhaps that’s the better question.
Thanks again for your help and guidance.
Koller and Stahel. Sharpening Wald-type Inference in Robust Regression for Small Samples. Computational Statistics & Data Analysis, 2011.
In addition to the excellent advice from @R_cubed consider semiparametric ordinary regression for which several resources may be found here. Ordinal regression such as the proportional odds or hazards models are very efficient and allow you to estimate exceedance probabilities (and survival curves), means, quantiles, and pseudomedians. They generalize Wilcoxon and log-rank tests and easily allow for detection limits and general censoring.
Thanks for the suggestion @f2harrell. Clearly, the method and thinking you suggest are beyond my level of statistical knowledge. I’ll look into it further.
What is your opinion about the limits of usability of multiple regression analysis in the presence of outliers? Can multiple regression still be used under certain conditions?
Regular parametric regression models can be ruined by a single high-leverage observation depending on the overall sample size. And note that ordinal models are not only Y-robust; they are also transformation-invariant until you get to the point of post-parameter estimation estimation of the mean. Classic robust regression like classic linear models are highly Y-transformation-dependent.
Thanks again @f2harrell. I asked the same question in here and received some comments and an answer. I shared it in case it might help the community or generate discussion.