I have a dataset with p variables and n observations. These variables are endogenous metabolites. I was asked to compute some additional variables (or scores, which are essentially ratios of the original variables), based on the original ones. The original variables contain missing values: part of my pipeline so far is to pre-process these variables, including data imputation. I was wondering what is the best approach to generate these scores:
- Use the raw original variables to generate these scores. If any of the original variables contains missings, the score will be
NA
. Then process these scores. The problem I see here is that ifscore_1
is the sum ofscore_2
toscore_20
, and if onlyscore_3
has a missing value, thenscore_1
isNA
. So I am going to impute a missing value even if the majority of the original variables are available. - Use the raw original variables to generate these scores, as before, but exclude the missing values when summing. In this case we would still obtain a numerical score. This method feels wrong, since the result depends on the subject.
- Process the raw original variables, including missing values imputation, and then create the scores.
Suggestions are welcome.