I have a dataset with p variables and n observations. These variables are endogenous metabolites. I was asked to compute some additional variables (or scores, which are essentially ratios of the original variables), based on the original ones. The original variables contain missing values: part of my pipeline so far is to pre-process these variables, including data imputation. I was wondering what is the best approach to generate these scores:
- Use the raw original variables to generate these scores. If any of the original variables contains missings, the score will be
NA. Then process these scores. The problem I see here is that if
score_1is the sum of
score_20, and if only
score_3has a missing value, then
NA. So I am going to impute a missing value even if the majority of the original variables are available.
- Use the raw original variables to generate these scores, as before, but exclude the missing values when summing. In this case we would still obtain a numerical score. This method feels wrong, since the result depends on the subject.
- Process the raw original variables, including missing values imputation, and then create the scores.
Suggestions are welcome.