Adding random observations to sample to determine quality of sample via outlier detection

Maskofillusion · October 6, 2021, 11:12am

Hello,

I am slightly new to the forum so I hope the title is acceptable. I wanted to know if there is an existing technique or literature where you add random observations (which are generated randomly with constraints) to your sample (be it a survey on say car perceptions) and then run outlier techniques to see if the respective outlier techniques can detect those randomly added cases and not your original sample observations.

The reason I want to do this is that I know very little of the population and if people answered poorly in the survey, outlier techniques should not be able to differentiate better between real observations and “generated” observations.

I hope this is clear enough and someone can potentially link me to some sources or literature?

Thanks!

R_cubed · October 6, 2021, 4:57pm

You might want to look up on the rather extensive literature on robust statistics. To do this in a mathematically correct way, the experts in this area have typically used mixture distributions to model outliers – ie p(\mathcal{N},Var) + (1-p)(\mathcal{N}, 10 \times Var) as an example. Then you could run your outlier tests on this synthetic data set where you are certain of the parameter values. John Tukey, Frank Hampel and Peter Huber have done the foundational work in this area.

The other scenario is to use inference techniques that are not sensitive to outliers. Classical nonparametrics would fall into this category.

A Bayesian method would specify a prior probability on the the possible distributions outliers could come from.

Related thread:

Maskofillusion · October 7, 2021, 6:03pm

Thanks for the reply! I am aware of some robust statistical techniques which as you say will often be able to deal with influential outliers (those which specifically affect your model fit, standardized betas etc).

I have wanted to look at some semi parametric techniques but yes non parametric techniques remain valuable.

I know too little of priors to make use of Bayesian at this stage.

I think in general my lack of confidence is situated in the actual sample responses. I want to know how good those are. I can fit a model and see if I get results but that doesn’t answer my question regarding quality of sample. If they adhere to most assumptions required for parametric testing then it is just that - I am able to run certain techniques and yes in some cases prior literature or work should tell me what sort of relationships etc I should likely see.

However, I want to create scenarios where that sample is specifically constrained with false observations to see how well existing techniques can differentiate between actual sample and false sample. I understand that creating observations that are randomly generated values from the possible min and max per variable will lead to observations that could have been possible (i.e. they won’t be picked up by Mahalanobis or say robust Mahalanobis in the case where we know the suspected underlying distribution is not normal) so I do want some idea of proportions. Thinking about this as well I do wonder how much this “experiment” will end up testing the outlier technique more than actually “verifying” the sample. This is making me wonder if I should not look at split half reliability and other sort of techniques by which to rather gauge the sample in addition to Mahalanobis and other appropriate outlier techniques?

R_cubed · October 8, 2021, 4:08pm

This is just a thought, but if you have an idea of what the distribution of erroneous data looks like, and have a model of what valid data looks like, can’t you use likelihood methods to assess the validity of your sample ie \frac{p(\Theta_{valid} | X=x)}{p(\Theta_{error}| X=x)}?

The likelihood is the data based complement to a Bayesian prior. So even if you don’t exactly know what your prior is, you might know that this LR would move a skeptical prior to a position of modest advocacy.

Applying this could be somewhat computationally complex, though.

f2harrell · October 8, 2021, 8:51pm

When the variable in question is the dependent variable, semiparametric ordinal models are hard to beat, whether Bayesian or frequentist.