Creating composite variables with partial missing values

lorenzoFabbri · September 29, 2023, 9:42am

I have a dataset with p variables and n observations. These variables are endogenous metabolites. I was asked to compute some additional variables (or scores, which are essentially ratios of the original variables), based on the original ones. The original variables contain missing values: part of my pipeline so far is to pre-process these variables, including data imputation. I was wondering what is the best approach to generate these scores:

Use the raw original variables to generate these scores. If any of the original variables contains missings, the score will be NA. Then process these scores. The problem I see here is that if score_1 is the sum of score_2 to score_20, and if only score_3 has a missing value, then score_1 is NA. So I am going to impute a missing value even if the majority of the original variables are available.
Use the raw original variables to generate these scores, as before, but exclude the missing values when summing. In this case we would still obtain a numerical score. This method feels wrong, since the result depends on the subject.
Process the raw original variables, including missing values imputation, and then create the scores.

Suggestions are welcome.