RMS Missing Data

JorgeTeixeira · July 21, 2023, 3:49pm

Sorry if this is not the most suitable thread for this question, but I hope the answer might be short.

In the context of an RCT, with the plan to analyze the data with an ANCOVA, would the following strategy be “okay” to impute missing data?

mice_imputed <- data.frame(
  original = titanic_train$Age,
  imputed_pmm = complete(mice(titanic_numeric, method = "pmm"))$Age,
  imputed_cart = complete(mice(titanic_numeric, method = "cart"))$Age,
  imputed_lasso = complete(mice(titanic_numeric, method = "lasso.norm"))$Age
)
mice_imputed

From this post.

Pretty sure it is not the best option, but would it be okay? Thank you.

f2harrell · July 21, 2023, 7:02pm

I don’t know how that is handling multiple imputations nor why you would want to impute with multiple methods in a single analysis.

Sudhi_Upadhyaya · September 13, 2023, 2:22pm

I have a 4 level categorical exposure variable. Around 5% of the exposure values are missing the remaining 95% of exposure values are non-missing. Is it okay to impute for missing values in exposure variable ?

f2harrell · September 13, 2023, 3:16pm

Yes but as stated in the text when the main variable of interest is the one that’s missing, the gain from imputation is less. You still gain from using the 5% to better impute other Xs that may be missing. And there is no harm in imputing the 5% anyway.

Sudhi_Upadhyaya · October 19, 2023, 2:15pm

I dont know what is the right terminology for this, but basically I have a set of rules for imputing a value in , lets say column B, based on what values are in column A , rowise. It goes something like this.

if column A = 0, then impute Column B = 35
if column A = 0, then impute Column B = 35
if column A = 1, then impute Column B = 36
.
,
,if column A = 2, then impute Column B = 37

Like this are 150 rows of rules for imputing values in Column B based on values in Column A. Now multiply this by times 2, one set of rules if gender = male, another set of rules/values if gender = female. Basically lot of, if then else, or case_when , or lot of lines of coding if I try to code these rules in r.

So instead, I tried this, lrm(ColumnB ~ ColumnA, data= df) and what I get is an intercept and Beta for ColumnA. The R2 adjusted values is around 0.998.

So my question is , can I use the intercept and slope from linear model to impute for future values in Column B based on random values in Column A , by converting the rule above into a linear model which saves me a lot of time and effort by avoiding to write 500 lines of if_then_else , code ?

Please note, I am using the intercept term when imputing for values in Column B based on values in Column A, because there is a rule that says, if Column A =0 then Column B = 37 so I am using the intercept to handle Column A =0 cases. So I am wondering if I can use this regression based approach when R2 adjusted = 0.998 instead of writing 500 lines of if-then-else or case-when statement.

f2harrell · October 19, 2023, 10:10pm

Please give a little background that also explains why the imputation is so strong.

Sudhi_Upadhyaya · October 21, 2023, 3:16am

Hi Frank, clinicians use a screening tool to clinically evaluate certain subgroup of subjects for say, “depression”. This screening tool is a 20 item questionnaire. each subject is a row with 20 columns, each column representing one question related to depression symptoms.

Each item(column) has 4 response options: No, Maybe, Mostly, Yes (madeup response levels) each response is assigned a score (No=0, Maybe=1, Mostly=2, Yes=3). So in the end the clinician takes the sum of all the item responses, which is a total score. The total score for each subject can vary from 0 to 150. This total score for each subject is stored in Column A. Since the range for total score is from 0 to 150 this is normed to a max score of 100 with population median 50. This normalized total score is called TScore which in my example has a minimum of 37 and maximum of 100 (35-100). This is called TScore and this based on psychometrics. I usually get datasets with only total scores for each subject (Column A). One of my task is to take the total score for each subject and compute their TScore (35-100). I have a sheet that tells me , If the total score is this, then impute this TScore .… ,
if column A (TotalScore) = 0, then impute Column B(TScore) = 35
if column A (TotalScore) = 1, then impute Column B(TScore) = 35
if column A (TotalScore) = 2, then impute Column B(TScore) = 36
.
,
,if column A (TotalScore) = 3, then impute Column B(TScore) = 37

This is a very tedious process with 150 lines of code if sex=male and another 150 lines of code if sex=female. So I tried the linear regression method mentioned above. I saw the r2 adjusted as 0.998 and took that approach. So I am wondering if this shortcut is okay ?

f2harrell · October 21, 2023, 12:53pm

Is this really a missing data imputation problem? Or is it a derived-variable computation problem?’

Note that it is not efficient programming practice to use a long series of if statements. In R there are many quick ways to do such calculations, some of them shown here. If the derivation doesn’t just involving scaling and centering, you can use for example

# Recode x=0-10 to new values
new <- c(10,11,11,12,13,13,14,15,16,16,17)
new[x + 1]

Sudhi_Upadhyaya · October 21, 2023, 2:50pm

I agree Frank. This is not a missing data issue. Pardon my ignorance. I will try that example. Thanks.

Hanis_Kadir · March 14, 2024, 1:50am

a <- aregImpute(~ age + sex + bp + death + heart.attack.before.death,
                data=mydata, n.impute=1)

Is n.impute=1 considered a single conditional imputation?

Additionally, is it possible to use both multiple imputation and backward selection?

f2harrell · March 14, 2024, 10:05pm

No, it just means you’re doing only one step of a multiple imputation. And backward selection will have to be done on multiply imputed datasets. Sometimes stacking helps as shown here.

sebastian · April 16, 2024, 3:02pm

Hi,

I’ve got a question regarding the handling of missing outcomes in a cox regression. Imagine an RCT with two groups, daily measurements (retrospectively assessed at later time-points), and the assumption of noninformative censoring. I’ve fit a cox regression with the usual right censoring, but I struggle with dealing with participants for whom I do not have any outcome data in the observation period. For these participants, I only do have baseline assessments (then randomization, then they dropped out without any information regarding the outcome). As my cox regression adjusts for baseline values only, I do have the covariates but not any idea of the event and time until event for these participants.

A similar topic was posted 2 years ago and I suppose that this topic is of interest (according to the number of views), while - as the original author - I did not find any recommendations in literature on how to handle these cases. In the previous topic, one suggestion was to censor these individuals at t=0 with event=0 and another recommendation to use imputation as sensitivity analysis. I know outcome imputation is not favorable, but as ITT analyses are recommended I wanted to know if there are any recommendations on handling these cases as censored at t=0 (effectively this should be a complete case analysis then, as the risk set size decreases), using MI or any other ways.

f2harrell · April 17, 2024, 11:38am

If it is true that not having outcome data is a completely random event with missingness tendency depending only on time zero or historical data or the individual then you can omit those observations from the analysis. Imputation in the absence of a surrogate stand-in outcome will not add any information except in special cases where that would help you impute missing baseline covariates.

But missing outcomes is usually symptomatic of a design flaw. Some designs are so brittle (so affected by a single non-randomly missing outcome) that even a tiny amount of missing data can bias the result. To prevent that you would need design it so that a pre-planned random fraction of subjects are not followed.

You mentioned ‘daily measurements’ which makes it sound like your data are not really time-to-event data but rather are longitudinal measurements, in which case a missing at random assumption would allow you to longitudinally analyze all available data before dropout.

MarcoLanzi · May 5, 2024, 8:56pm

Dear Professor,
But how do we choose the right imputation to fill the Na? Is there one? I thought it was incorrect to extrapolate only one of the x imputed dataset.

Thank you very much!
Marco

f2harrell · May 6, 2024, 5:49pm

I was referring to likelihood-based analysis with missing at random in longitudinal measurements, where no imputation is needed.

PhilYeung · May 14, 2024, 10:23pm

I was following the codes from
#2.7 Check Missing data
missChk(d) #qreport----> Didn’t output HTML

Error messages:

Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.


![plot of chunk chnk37](figure/chnk37-1.png)

------------

#Basic Descriptive Statistics

#2.10 Univariate and Bivariate Descriptions
**###Not Working**
sparkline::sparkline(0)

des <- describe(d)
maketabs(print(des, 'both'), wide=TRUE)

f2harrell · May 15, 2024, 1:40pm

This is not the best topic for programming questions that don’t relate to rms package missing data functions. Please though add a self-contained quarto script that is minimal that reproduces the error and I’ll try to debug.

jtondt · November 18, 2024, 8:57pm

I haven’t seen this answered elsewhere but I apologize if it has and I missed it. If one is rank transforming a continuous outcome from a clinical trial to use a semiparametric ordinal model, then how would you handle missing values due to dropouts? Maximum likelihood?

f2harrell · November 18, 2024, 10:14pm

Don’t ever rank transform the data. Analyze the original scale. Semiparametric models do the ranking for Y internally. Impute on raw scale too.

jtondt · November 19, 2024, 12:12pm

Thank you! I didn’t realize the original scale would be used as ranks in the model, but that’s great