Using MICE to impute missing data across join of two data sets

Hi all, first time poster here. I’m a data scientist working in the New Zealand dairy industry.

I have a problem where I have two data sets, the first is about 140000 rows and has region, year and annual_quantity. The second data set has about 1400 rows and has region, year, annual quantity, and a range of inputs by month. Each row is a business x a year. I do not have business IDs to match the rows between the data sets.

I want to extrapolate the second data set to infer/impute monthly inputs across the entire first data set. I have to prepare a report of monthly inputs for the whole first data set.

What I have done so far is to build a combined dataframe with both data sets (about 85% missing values) and use the MICE package in R to impute the missing values. The idea is that region x year x annual_amount can be used to model the monthly inputs, and the iterative nature of MICE will eventually find a stable relationship.

I have tried a range of methods in MICE, many of them crash (due to a singular matrix I think) but random forest (rf) works and gives answers, at least. Currently I am using 5 reps and 20 iterations, but this doesn’t quite converge and I’m going to try 80 iterations tonight.

I also need to test somehow whether the Missing At Random (MAR) assumption is too badly violated, how would I do this?

Any suggestions on how to do this kind of imputation? Thanks so much for your help!

Also, does rf (random forest) always give predictions within the bounds off the data, or can it extrapolate (linearly?) beyond the range of the data?

Random forests are famous for extreme overfitting, which in your case translates to over-imputing. Not sure I would trust it. It always gives a “result” but that may not be a good “answer”.

True but it’s the only method that is not crashing.

I would not take ‘not crashing’ as a signal of validity.

How would you suggest I proceed?

Problem solved: there was too much variation in the scales between my data columns, this was casuing a numerical error, when I rescaled them I was able to use the pmm method.

1 Like