Outlier detection

Sudhi_Upadhyaya · May 11, 2023, 3:30am

I am analyzing a dataset where the goal is to find the nature of association between lead concentration in drinking water (exposure) and child cognition (continuous outcome).

Lead concentration in water (exposure) is from three different cities across US. Uneven number of participants from each city. City A , 100 participants City B 1000 participants and City C 300 participants.

Experts feel that lead concentration in City B 1000 participants is unusually high, and box plots indicate that this the lead concentration from this city (City B) is indeed high.

I am interested in identifying outliers , and performing the analysis with and without the outliers. How do I detect outliers when the sample is from three distinct group (3 different city residents) ? Any suggestions is much apprecated, Thanks as always.

Jochen · May 11, 2023, 5:30am

I think you should distinguish “statistical outliers” from “factual outliers”.

Factual outliers are values that are extremely unrealistic or even impossible (physically, chemically, biologically). It should usually be sufficient to have a look at the extreme values to see that (rarely particular combinations of values are impossible, what requires a multivariate analysis to identify them).

Statistical outliers are values that are just far away from their expected value. They are identified most simply in a residual analysis: first fit your model and then inspect some residual diagnostic plots. Such plots also allow you to judge if your distributional assumptions are reasonable. Under unreasonable distributional assumptions, it is usually more likely to find numerous outliers (if you assume a normal distribution but the actual distribution is more similar to log-normal or gamma, it’s likely to see “too large” values). But is you see that the assumptions are reasonable and if there are still outliers, you may also see their impact on the model (leverage). If this is small, there is actually no need to take any further action. It the leverage is large, it may be worth to investigate if there is some yet unrecognized reason why this particular value is outlying. If such a reason can be identified, the value should be excluded. Otherwise it might be simply a value trying to tell you that the rest of the data would underestimate the actual variance.

I am reserved about excluding statistical outliers. Outliers should be rare. If they are rare, then they won’t relevantly impact your model/conclusions. If they are not rare, then either your data collection process has some severe problems or your model/assumptions are severely flawed. In either case I wouldn’t trust the conclusions, no matter whether or not the outlieres are used.

f2harrell · May 11, 2023, 11:50am

I concur with Jochen and would almost never exclude an apparent “outlier” unless it can be demonstrated to be an erroneous value/data entry. I like to use methods that are robust to extremes. If a value is erroneous it’s often best to set it to missing and use multiple imputation and is seldom good to remove the whole observation.

Sudhi_Upadhyaya · May 11, 2023, 2:28pm

@Jochen , @f2harrell , Thanks Joachen, Frank. I will incorporate your advise. Thanks again.