I have a dataset with p variables and n observations. These variables are endogenous metabolites. I was asked to compute some additional variables (or *scores*, which are essentially ratios of the original variables), based on the original ones. The original variables contain missing values: part of my pipeline so far is to pre-process these variables, including data imputation. I was wondering what is the best approach to generate these scores:

- Use the raw original variables to generate these scores. If any of the original variables contains missings, the score will be
`NA`

. Then process these scores. The problem I see here is that if`score_1`

is the sum of`score_2`

to`score_20`

, and if only`score_3`

has a missing value, then`score_1`

is`NA`

. So I am going to impute a missing value even if the majority of the original variables are available. - Use the raw original variables to generate these scores, as before, but exclude the missing values when summing. In this case we would still obtain a numerical score. This method feels wrong, since the result depends on the subject.
- Process the raw original variables, including missing values imputation, and then create the scores.

Suggestions are welcome.