Using MICE to impute missing data across join of two data sets

woodwards · August 30, 2022, 4:14am

Hi all, first time poster here. I’m a data scientist working in the New Zealand dairy industry.

I have a problem where I have two data sets, the first is about 140000 rows and has region, year and annual_quantity. The second data set has about 1400 rows and has region, year, annual quantity, and a range of inputs by month. Each row is a business x a year. I do not have business IDs to match the rows between the data sets.

I want to extrapolate the second data set to infer/impute monthly inputs across the entire first data set. I have to prepare a report of monthly inputs for the whole first data set.

What I have done so far is to build a combined dataframe with both data sets (about 85% missing values) and use the MICE package in R to impute the missing values. The idea is that region x year x annual_amount can be used to model the monthly inputs, and the iterative nature of MICE will eventually find a stable relationship.

I have tried a range of methods in MICE, many of them crash (due to a singular matrix I think) but random forest (rf) works and gives answers, at least. Currently I am using 5 reps and 20 iterations, but this doesn’t quite converge and I’m going to try 80 iterations tonight.

I also need to test somehow whether the Missing At Random (MAR) assumption is too badly violated, how would I do this?

Any suggestions on how to do this kind of imputation? Thanks so much for your help!

woodwards · August 31, 2022, 1:49am

Also, does rf (random forest) always give predictions within the bounds off the data, or can it extrapolate (linearly?) beyond the range of the data?

f2harrell · August 31, 2022, 12:09pm

Random forests are famous for extreme overfitting, which in your case translates to over-imputing. Not sure I would trust it. It always gives a “result” but that may not be a good “answer”.

woodwards · August 31, 2022, 6:48pm

True but it’s the only method that is not crashing.

f2harrell · August 31, 2022, 7:37pm

I would not take ‘not crashing’ as a signal of validity.

woodwards · August 31, 2022, 8:57pm

How would you suggest I proceed?

woodwards · September 1, 2022, 3:30am

Problem solved: there was too much variation in the scales between my data columns, this was casuing a numerical error, when I rescaled them I was able to use the pmm method.