Imputation in multilevel data - categories missing completely

Neil_Lawrence1 · May 29, 2024, 12:49pm

Dear all,

I have biomarker data from a registry, but am unsure about what units are in use in some centres for each biomarker. I am therefore missing a conversion factor for some centres, to be able to standardise the units across the whole dataset. The distribution of the values from that centre offers a clue as to which units are in use at each centre. The more readings there are from a centre, the more likely I will be able to correctly impute the required unit, and therefore the conversion factor. I’m therefore left with clustered multilevel data, needing to impute a categorical variable that equates to a numeric conversion factor. I can imagine multiple imputation best resulting in the units being imputed differently in different imputation sets, but think that the units imputed within each separate imputed dataset should be the same within each centre. I’m confused as to what package I can use to impute this variable, and how I’m best organising the data to obtain sensible imputations. Does anyone have any advice about what methodology / R package would be best employed for this task?

Sudhi_Upadhyaya · May 29, 2024, 1:04pm

Hi Neil,

I have two answers to you question.

Mice : You can use the functions within “mice” package to address this issue. Specifically,

An entry of -2 in the predictor matrix signals the cluster variable, whereas an entry of 3 indicates that the cluster means of the covariates are added as a predictor to the imputation model. Chapter 7 Section 7.10.2 .

aregImpute: I have not used this yet. This has more advanced features.

Either options, I would start by setting the predict.method = pmm first and see if you get reasonable results

https://www.jstatsoft.org/index.php/jss/article/view/v045i03/550
https://www.jstatsoft.org/index.php/jss/article/view/v045i03/2357

Neil_Lawrence1 · May 30, 2024, 8:48am

Dear Sudhi,

Thank you very much for your response. Predictive mean matching won’t quite work in this context from MICE, at least not in the straightforward way you suggest.

This is likely due to my clumsy way of explaining the problem, apologies. To try a slightly different way; I have clustered data with measurements within clusters that are all complete. However, I do not know the units measured for some of the clusters. I do know that the units measured by one cluster will be the same as one of the other clusters, I’m just not certain which one. I need an algorithm that looks at the distribution of all of the data within each cluster, and imputes the units that need to be applied to all of the values within the clusters where I don’t know units. Therefore, I cannot use predictive mean matching to use donor values from within the clusters, as the target variable for imputation is not known for some of the clusters, and needs to be imputed based on the similarity of the cluster to the other clusters.

I suppose another way to think of this as that I have three level data. The first level is the values (which are complete), the second level is the centres (which are complete), and the third level is the units that are used by the centre (which is categorical and incomplete, and from which I can calculate a conversion factor in order to standardise the data across all the centres). Using predictive mean matching in this context for the third level of my data would be very useful - I want to take the units from other clusters and apply them to the clusters that I don’t have units for. What is the best methodology for me to use predictive mean matching for this third level categorical variable?

kiwiskiNZ · May 31, 2024, 10:28pm

Greetings Neil,
I guess you may have already tried this, and I have had similar issues in the past, but the clinical chemistry labs and people in charge of EHRs at each institution must have a record of what assay was in use at what times. If you have not exhausted the possibility already I would get back to them and try and fill in a few blanks first. Having said that, the distribution is sometimes a giveaway but that depends a bit on how many assays are out there for the same analyse. Sometimes (eg troponin I), it is not merely a matter of different units, but also the specific assays can have different distributions. You need to be confident that this is not the case with your data.
All the best!