Hi all, first time poster here. I’m a data scientist working in the New Zealand dairy industry.
I have a problem where I have two data sets, the first is about 140000 rows and has region, year and annual_quantity. The second data set has about 1400 rows and has region, year, annual quantity, and a range of inputs by month. Each row is a business x a year. I do not have business IDs to match the rows between the data sets.
I want to extrapolate the second data set to infer/impute monthly inputs across the entire first data set. I have to prepare a report of monthly inputs for the whole first data set.
What I have done so far is to build a combined dataframe with both data sets (about 85% missing values) and use the MICE package in R to impute the missing values. The idea is that region x year x annual_amount can be used to model the monthly inputs, and the iterative nature of MICE will eventually find a stable relationship.
I have tried a range of methods in MICE, many of them crash (due to a singular matrix I think) but random forest (rf) works and gives answers, at least. Currently I am using 5 reps and 20 iterations, but this doesn’t quite converge and I’m going to try 80 iterations tonight.
I also need to test somehow whether the Missing At Random (MAR) assumption is too badly violated, how would I do this?
Any suggestions on how to do this kind of imputation? Thanks so much for your help!