RMS Semiparametric Ordinal Longitudinal Model

f2harrell · September 2, 2025, 3:31pm

This is enormously helpful; thanks for the further research and for the constraint example. As a side note, in the simplest linear model case failure to include important covariates will inflate type II assertion probabilities (none of what we’re talking about is “errors”).

The inflations you are seeing are pretty minor. It would be helpful to separate bias from problematic standard errors. Often SEs are affected more than \hat{\beta} and can be fixed by robust sandwich covariance estimators (but not when doing Bayesian modeling). Also I want to check that the omitted variables are “simple”, i.e., they don’t act in non-proportional odds or interact with time or previous state. If on the other hand there is non-proportional odds that is nonlinear in time I would expect more problems.

It would be interesting to know if random effects fix this particular problem. I see no direct reason they should, but they do absorb some unknowns that might happen to help.

I assume MCSE is the Monte Carlo SE of the estimated \alpha.

Johannes_Schwenke · September 2, 2025, 7:58pm

Yes, I just need to keep in mind that many people I’m working with take the notion of frequentist errors very seriously, so no way around that for me currently. I’m also clear on predictive covariates reducing the MSE (though in simulation 1 there are no covariates, so type II should not be affected).

I’ve added bias to the table below (mean of the estimates of the treatment parameter on the logodds scale - 0). I don’t think bias is the problem. I have not yet figured out how to use robust sandwich covariance estimators with vglm, any resource on that would be much appreciated.

Table: Comparison of Type I Error Rates and Bias Across Simulation Scenarios

Model Specification	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
Correctly Specified	100	0.050 (0.007)	-0.002 (0.005)	1000
Correctly Specified	250	0.056 (0.007)	-0.004 (0.003)	1000
Correctly Specified	500	0.060 (0.008)	-0.001 (0.002)	1000
Missing Covariates	100	0.069 (0.008)	-0.004 (0.005)	1000
Missing Covariates	250	0.072 (0.008)	-0.003 (0.003)	1000
Missing Covariates	500	0.062 (0.008)	-0.001 (0.002)	1000
Missing Covariates	1000	0.079 (0.009)	-0.002 (0.002)	1000

Yes, that’s correct

Given that the data are simulated from the model below, where we know the effects to be “simple” this can’t be the reason, unless I’m misunderstanding your point.

fit_po <- vglm(
  y ~ yprev + time_linear + time_nonlinear_1 + time_nonlinear_2 + ecog_fstcnt + diagnosis + age + albumin,
  family = cumulative(parallel = TRUE, reverse = TRUE),
  data = d
)

I also don’t quite understand what unkowns that would be, given that we’re dealing with simulated data.

f2harrell · September 4, 2025, 12:17pm

The use of the word error here points out a major misconception on the part of your colleagues. \alpha has nothing to do with error probabilities, which must be unconditional (on the truth). \alpha is an assertion trigger probability, nothing more.

The reason to check \alpha is not because of a noble goal such as optimizing the probability of making correct decisions (it doesn’t do that) but rather is just because if we are doing a frequentist analysis we have a right to expect p-values to have uniform distributions under the null. So bottom line is that it’s still important when being fully frequentist. Then the question is what magnitude of non-uniformity of p-values is “good enough” under the null, with a second question of why pure frequentists don’t worry about \alpha inflation of this magnitude that occurs routinely because normal approximations to sampling distributions don’t hold?

Nice work on looking at bias; maybe adding root mean squared of information matrix-based standard errors alongside the simulated SEs would also help. Sandwich estimators may fix a problem that comes solely from SEs. I don’t know if the sandwich package supports VGAM but the first thing to check is whether a residuals method for VGAM will compute the entire score matrix (N \times p).

Great points about the simulation setup and what we know to be true.

My only other suggestion right now is that if you concentrat on N=250 and vary the strength of a single omitted covariate that might shed some light.

f2harrell · September 6, 2025, 12:44pm

I am second-guessing my previous response about random effects. Even though the data generating model does not include any random effects, adjusting for random effects may possibly be a way to account for omitted covariates when estimating standard errors.

f2harrell · September 9, 2025, 4:30pm

One other follow-up thought: Whenever we see a problem we must examine whether there is a method available that does not have that problem. For example how would you translate this to a time-to-event analysis and what is the effect of strong unmeasured covariates on \alpha for a Cox PH model?

Johannes_Schwenke · September 12, 2025, 7:59pm

Yes, I was thinking along those lines! (Though this is just based on some vague ‘intuition’ rather than any theory, at which I’m pretty bad…)

I didn’t have time to get robust SEs to work with VGAM, but I did run some simulations as you suggested, just with naive SEs.

All effects are proportional and linear
5 categories of Y
Intercepts : -6, -3, -2.5, -2 → (category 1 (reverse coded) is by far the most common initially)
28 time points (1:28)
Proportional effect of time log(0.95) on logodds scale
Effects of yprev similar to my historic data:
yprev=2 = log(10),
yprev=3 = log(3),
yprev=4 = log(40)
Effect of omitted covariate X ranging from -4 to 4

Code:

Data Generation

Setup

library(tidyverse)
library(VGAM)
library(arrow)

Simulate baseline data

N <- 250000
x_effects <- seq(-4, 4, by = 0.5)
times <- 1:28

# intercept and parameters
ints <- c(-6, -3, -2.5, -2)

# treatment effect
parameter <- 0

Simulation Function

g <- function(yprev, t, tx_val, x_val, parameter = 0, extra_params) {
    # 1. Treatment Effect
    tx_effect <- parameter * tx_val
    # 2. Effect of covariate X
    cov_effect <- extra_params["X"] * x_val
    # 3. Effect of time
    time_effect <- extra_params["week"] * t
    # 4. Previous state effect
    yprev_effect <- 0
    # Create the name to look up in the 'extra' vector, e.g., "yprev=2"
    yprev_coef_name <- paste0('yprev=', yprev)
    if (yprev_coef_name %in% names(extra_params)) {
        yprev_effect <- extra_params[[yprev_coef_name]]
    }
    # Sum all effects to get the linear predictor
    lp <- as.numeric(tx_effect + cov_effect + time_effect + yprev_effect)
    return(lp)
}

Apply simulation function

Create baseline data

X_base <- data.frame(
    tx = sample(c(0, 1), size = N, replace = TRUE),
    X = rnorm(N),
    yprev = sample(c("1", "2", "3", "4"), N, replace = TRUE,  prob = c(0.8, 0.1, 0.05, 0.05))
) |>
mutate(patient_id = row_number())

Generate datasets

for (current_x_effect in x_effects) {
    
    cat(paste0("--- Starting simulation for X effect = ", current_x_effect, " ---\n"))

    # a. Set the model parameters for the current scenario.
    extra <- c(
        `yprev=2` = log(10), 
        `yprev=3` = log(3), 
        `yprev=4` = log(40),
        week      = log(0.95)
    )
    extra["X"] <- current_x_effect

    # b. Initialize the matrix to store patient states over time.
    state_matrix <- matrix(
        as.numeric(X_base$yprev), 
        nrow = N, 
        ncol = length(times) + 1,
        dimnames = list(X_base$patient_id, 0:28)
    )

    # c. Run the simulation over the time points.
    for (t in times) {
        yprev <- state_matrix[, t]
        active_idx <- which(yprev != 5)
        
        if (length(active_idx) == 0) {
          if (t <= max(times)) {
            state_matrix[, (t + 1):ncol(state_matrix)] <- 5
          }
          break
        }
        
        yprev_active <- yprev[active_idx]
        X_active <- X_base[active_idx, ]
        
        lp <- mapply(
            FUN = g,
            yprev = yprev_active,
            tx_val = X_active$tx,
            x_val = X_active$X,
            MoreArgs = list(t = t, parameter = parameter, extra_params = extra)
        )
        
        thresholds_matrix <- outer(ints, lp, FUN = "+")
        cum_probs <- t(plogis(thresholds_matrix))
        prob_matrix <- cbind(cum_probs, 1) - cbind(0, cum_probs)
        
        y_new_active <- apply(prob_matrix, 1, function(p) {
            sample(5:1, size = 1, prob = p)
        })
        
        state_matrix[, t + 1] <- yprev
        state_matrix[active_idx, t + 1] <- y_new_active
    }

    # d. Format the results into a long data frame.
    final_data <- as.data.frame(state_matrix) |>
        rownames_to_column(var = "patient_id") |>
        pivot_longer(
            cols = -patient_id,
            names_to = "time",
            values_to = "state"
        ) |>
        mutate(
            patient_id = as.integer(patient_id),
            time = as.integer(time)
        ) |>
        left_join(X_base |> rename(yprev_initial = yprev), by = "patient_id") |>
        arrange(patient_id, time) |>
        group_by(patient_id) |>
        mutate(yprev = lag(state)) |>
        ungroup()

    # e. Save the final data to a Parquet file.
    file_name <- sprintf("simulation_x_effect_%.1f.parquet", current_x_effect)
    write_parquet(final_data, file_name)
    
    cat(paste0("Saved data to '", file_name, "'\n\n"))
}

cat("--- All simulations complete. ---\n")

Simulations

Setup

library(arrow)
library(tidyverse)
library(furrr)
library(glue)
library(fs)
library(here)
library(VGAM)

source(here("functions", "jackknife_mcse.R"))
source(here("functions", "check-completed-simulations.R"))
source(here("functions", "bayes-hosp-run-markov-simulation-iteration.R"))

Simulation function

run_simulation_iteration <- function(iter_num, base_path, full_data, model_formula, constraint = NULL, cumulative_arg, seed = 123, sample_size = 250) {
    # This function runs one full simulation iteration.
    
    # --- 1. Setup for the current iteration ---
    set.seed(seed + iter_num)

    # allocation ratio currently harccoded
    tx_ids <- full_data |> 
        distinct(id) |> 
        pull(id) |> sample(round(0.5* sample_size), replace = FALSE)

    soc_ids <- full_data |> 
        distinct(id) |> 
        pull(id) |> 
        setdiff(tx_ids) |> 
        sample(round(0.5* sample_size), replace = FALSE)

    data_for_model <- full_data |> 
        filter(id %in% c(tx_ids, soc_ids)) |>
        mutate(
            tx = if_else(id %in% tx_ids, "1", "0"),
            tx = factor(tx, levels = c(0, 1)))
    
    rm(full_data)
    
    if (nrow(data_for_model) > 0) {
        model <- tryCatch({ 
            vglm(
                model_formula,
                cumulative_arg,
                constraint = constraint,
                data = data_for_model)
            
        }, error = function(e) {
            message(glue("ERROR in iteration {iter_num}: {e$message}"))
            
            return(NULL)
        })
        
        # --- 7. Save the results ONLY if the model ran successfully ---
        if (!is.null(model)) {
            summary_df <- summary(model)@coef3 |> data.frame() |> 
                rownames_to_column("term")

            return(summary_df)
        }
    }
    return(NULL)
}

Sim Specs

N_SIMULATIONS <- 1000
N_CORES <- 8
future::plan(future.callr::callr, workers = N_CORES)

Run Simulations

OUTPUT_PATH <- here("9-markov-tests", "output")
DATA_PATH <- here("9-markov-tests", "data")

effect_sizes <- seq(-4.0, 4.0, by = 0.5)

# define model specs
main_formula <- ordered(y) ~ tx + time + yprev
cumulative_formula <- cumulative(reverse = TRUE, parallel = TRUE)

# Loop through each effect size, load the corresponding data, and run the simulation
for (effect in effect_sizes) {
    
    effect_str <- sprintf("%.1f", effect)
    
    message(glue("\n--- Running simulation for x_effect = {effect_str} ---\n"))
    
    INPUT_FILE <- path(DATA_PATH, glue("simulation_x_effect_{effect_str}.parquet"))
    RESULTS_FILE <- path(OUTPUT_PATH, glue("results_{effect_str}.parquet"))

    df_full <- read_parquet(INPUT_FILE) |> 
        filter(time > 0) |> 
        filter(yprev != 5) |> #df is with death carried forward, which we need to remove
        rename(
            id = patient_id,
            y = state) |> 
        mutate(
            y = factor(y, ordered = TRUE, levels = 1:5),
            yprev = factor(yprev, ordered = FALSE)
        )
    
    # --- Run Simulations in Parallel ---
    results <- furrr::future_map_dfr(
        .x = 1:N_SIMULATIONS,
        .f = ~ run_simulation_iteration(
            iter_num = .x,
            full_data = df_full,
            model_formula = main_formula,
            cumulative_arg = cumulative_formula,
            sample_size = 250,
            seed = 123
        ),
        .options = furrr_options(seed = TRUE),
        .progress = TRUE,
        .id = "iter"
    )

    write_parquet(results, RESULTS_FILE)
}

Summarise simulations results

result_files <- fs::dir_ls(OUTPUT_PATH, glob = "*.parquet")

process_simulation_results <- function(file_path) {
    filename <- fs::path_file(file_path)
    
     # Extract the omitted covariate's effect strength from the filename
    x_effect_strength <- as.numeric(stringr::str_extract(filename, "-?\\d+\\.\\d+"))
    
    # Read simulation data for the treatment term
    sim_data <- arrow::read_parquet(file_path) |>
        filter(term == "tx1")
    
    # Calculate Type I error and its MCSE
    type_i_indicators <- sim_data |> pull(`Pr...z..`) < 0.05
    type_i_error_rate <- mean(type_i_indicators, na.rm = TRUE)
    type_i_error_mcse <- jackknife_mcse(estimates = type_i_indicators)
    
    # Calculate Bias and its MCSE (true effect is 0)
    estimates <- sim_data |> pull(Estimate)
    bias <- mean(estimates, na.rm = TRUE)
    bias_mcse <- jackknife_mcse(estimates = estimates)

    # Return a tibble with formatted columns
    tibble::tibble(
        `X Effect Strength` = x_effect_strength,
        `Sample Size` = 250,
        `Type I Error (MCSE)` = sprintf("%.3f (%.3f)", type_i_error_rate, type_i_error_mcse),
        `Bias (MCSE)` = sprintf("%.3f (%.3f)", bias, bias_mcse),
        `Simulations` = nrow(sim_data)
    )
}

summary_table <- purrr::map_dfr(result_files, process_simulation_results) |>
    arrange(`X Effect Strength`)

knitr::kable(
    summary_table,
    caption = "Comparison of Type I Error and Bias for Varying Strengths of an Omitted Covariate"
)

Table: Comparison of Type I Error and Bias for Varying Strengths of an Omitted Covariate

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-4.0	250	0.121 (0.010)	0.001 (0.005)	1000
-3.5	250	0.121 (0.010)	-0.001 (0.005)	1000
-3.0	250	0.133 (0.011)	-0.003 (0.005)	1000
-2.5	250	0.116 (0.010)	0.002 (0.004)	1000
-2.0	250	0.133 (0.011)	-0.000 (0.004)	1000
-1.5	250	0.128 (0.011)	-0.003 (0.004)	1000
-1.0	250	0.124 (0.010)	-0.004 (0.004)	1000
-0.5	250	0.078 (0.008)	-0.003 (0.003)	1000
0.0	250	0.053 (0.007)	0.010 (0.003)	1000
0.5	250	0.103 (0.010)	0.004 (0.003)	1000
1.0	250	0.118 (0.010)	0.003 (0.004)	1000
1.5	250	0.133 (0.011)	0.002 (0.004)	1000
2.0	250	0.112 (0.010)	0.004 (0.004)	1000
2.5	250	0.130 (0.011)	0.005 (0.004)	1000
3.0	250	0.123 (0.010)	-0.001 (0.004)	1000
3.5	250	0.131 (0.011)	-0.000 (0.005)	1000
4.0	250	0.137 (0.011)	0.001 (0.005)	1000

So (if my code is correct) bias does not seem to be the issue, but probably SEs that are too small.

I’ll try to investigate whether random intercepts improve things, probably with the ordinal package (?). Will also try to obtain some robust SEs with the ordinal package.
It would probably also be interesting to look at the range between -1.5 and 1.5 logodds for the omitted covariate with a bit more granularity and more iterations. Seems like a range where most of the action is going on. I’m also interested in whether having no absorbing state changes things (I don’t see why it should).

Totally agree. Anecdotally, I ran some quick comparisons with negative binomial regression with days outside hospital with survival time as logoffset, and this was overly conservative. But nothing systematic so far. I hope to do more comparisons.

Edit: Just realized that as everything is proportional, I could just use rms for easy robust SEs. But will try to get it to work from VGAM as well, because I want to do some simulations with some constrained nonPO as well.

f2harrell · September 13, 2025, 12:13pm

Excellent work again! This is somewhat reassuring and paves the way for robust SEs to help, although I don’t know of a Bayesian counterpart. Also good thinking about seeing whether ordinal will handle robust SEs internally, and just using rms if PO holds everywhere.

Big picture: I re-simulated VIOLET after a lot of analyses to give me confidence in a first-order Markov process accurately representing the data. In general it would be helpful to simulate a variety of other datasets with other true models, for comparing methods. Max Rohde has done a wealth of simulations and may be joining this discussion. His simulations have not included the omitted predictor issue though. Perhaps he or someone else could take his code and extend it to that as this would give us performance relative to a few other analytical approaches.

As an aside I didn’t know that discourse implemented “click to expand” as you so nicely used above. How did you do that?

Johannes_Schwenke · September 15, 2025, 9:41am

You can easily insert it like this:

Or just wrap the text in this:

[details="Summary"]
This text will be hidden
[/details]

Johannes_Schwenke · September 15, 2025, 10:37am

After studying the manuals, ordinal does not seems to handle robust SEs. I’m also struggling to manually implement the ‘meat’ of the sandwich for VGAM. (VGAM really is the frequentist package with the most modelling options for our markov models, so robust SEs for that seems like a much needed addition.)

Type I error rates

Naive SEs

Ordinal, VGAM, or RMS give pretty much the same results when using naive SEs, which is what we expect. The minor difference are not because of simulation error, because all the models are fit on the same datasets. VGAM probably does some calculations slightly differently.

Table 1: Comparison of Type I Error and Bias for the Treatment indicator for Varying Strengths of an Omitted Covariate (rms::orm)

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.5	250	0.127 (0.011)	-0.003 (0.004)	1000
-1.0	250	0.124 (0.010)	-0.004 (0.004)	1000
-0.5	250	0.077 (0.008)	-0.003 (0.003)	1000
0.0	250	0.052 (0.007)	0.010 (0.003)	1000
0.5	250	0.103 (0.010)	0.004 (0.003)	1000
1.0	250	0.116 (0.010)	0.003 (0.004)	1000
1.5	250	0.133 (0.011)	0.002 (0.004)	1000

Table 2: Comparison of Type I Error and Bias for the Treatment indicator for Varying Strengths of an Omitted Covariate (ordinal::clm)

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.5	250	0.127 (0.011)	-0.003 (0.004)	1000
-1.0	250	0.124 (0.010)	-0.004 (0.004)	1000
-0.5	250	0.077 (0.008)	-0.003 (0.003)	1000
0.0	250	0.052 (0.007)	0.010 (0.003)	1000
0.5	250	0.103 (0.010)	0.004 (0.003)	1000
1.0	250	0.116 (0.010)	0.003 (0.004)	1000
1.5	250	0.133 (0.011)	0.002 (0.004)	1000

Table 3: Comparison of Type I Error and Bias for the Treatment indicator for Varying Strengths of an Omitted Covariate (VGAM::vglm)

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.5	250	0.128 (0.011)	-0.003 (0.004)	1000
-1.0	250	0.124 (0.010)	-0.004 (0.004)	1000
-0.5	250	0.078 (0.008)	-0.003 (0.003)	1000
0.0	250	0.053 (0.007)	0.010 (0.003)	1000
0.5	250	0.103 (0.010)	0.004 (0.003)	1000
1.0	250	0.118 (0.010)	0.003 (0.004)	1000
1.5	250	0.133 (0.011)	0.002 (0.004)	1000

Robust SEs

Using rms with robcov(orm_fit, cluster = data$id) for cluster robust Huber-White SEs, the Type-I-error inflation seems to be eliminated. More iterations are needed to get more precise results, but any inflation would seem to be very modest.

Table 4: Comparison of Type I Error and Bias for Varying Strengths of an Omitted Covariate with cluster robust SEs (rms::orm with robcov)

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.5	250	0.064 (0.008)	-0.003 (0.004)	1000
-1.0	250	0.062 (0.008)	-0.004 (0.004)	1000
-0.5	250	0.048 (0.007)	-0.003 (0.003)	1000
0.0	250	0.055 (0.007)	0.010 (0.003)	1000
0.5	250	0.060 (0.008)	0.004 (0.003)	1000
1.0	250	0.047 (0.007)	0.003 (0.004)	1000
1.5	250	0.056 (0.007)	0.002 (0.004)	1000

Random Intercepts

Using ordinal’s clmm() with random intercepts (ordered(y) ~ tx + time + yprev + (1|id)) seems to absorb some of the effect of the unmeasured predictor. However, when the effect is small-ish (0.5 logodds, haven’t checked for smaller non-zero values), convergence issues start to be a problem. I assume these convergence issues would be less of an issue for bayesian models.

Table 5: Ordinal Model (clmm with Random Intercept): Comparison of Type I Error and Bias

X Effect Strength	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.0	250	0.075 (0.008)	-0.005 (0.005)	997
-0.5	250	0.068 (0.008)	-0.005 (0.004)	982
0.5	250	0.081 (0.009)	0.003 (0.004)	983
1.0	250	0.060 (0.008)	0.001 (0.005)	994

For a null-effect of the omitted predictor the model will not converge at all, so I’ve omitted it from the table. I’ve also tested fewer X effect strengths, because computation took quite a while on my laptop. I’ve removed non-converged iterations (probably not the best solution). In the simulated data X is standard normally distributed, so maybe this is one reason why the random intercepts work well-ish. Might work less well if it was binary. However, I expect few truly discrete unmeasured predictors to exist in reality. We usually categorize for measurement, so I think random intercepts should work rather similarly for real data.

Bias

I decided to also look separately at the bias of all parameters, not just the treatment parameter. If we exclude the predictor, the estimates for the yprev are biased. I think intuitively this is not surprising, because if the models ‘tries’ to predict conditional thresholds well, but lacks an important predictor, something else has to give for optimal predictions. What I don’t understand though: Why is it just yprev that’s affected, and not time (or to a lesser extent)? It might still become biased if the unmeasured predictor has a larger effect, but didn’t have time to simulate that yet. What would happen if we interact yprev with time? Will time also be more biased?

Table 5: Bias of all parameters for Varying Strengths of an Omitted Covariate (rms::orm)

X Effect Strength	Sample Size	‘tx’ Bias (MCSE)	‘time’ Bias (MCSE)	‘yprev=2’ Bias (MCSE)	‘yprev=3’ Bias (MCSE)	‘yprev=4’ Bias (MCSE)	Simulations
-1.5	250	-0.003 (0.004)	-0.019 (0.000)	0.400 (0.005)	0.660 (0.006)	0.400 (0.005)	1000
-1.0	250	-0.004 (0.004)	-0.007 (0.000)	0.284 (0.005)	0.457 (0.006)	0.284 (0.004)	1000
-0.5	250	-0.003 (0.003)	-0.000 (0.000)	0.111 (0.005)	0.160 (0.007)	0.111 (0.004)	1000
0.0	250	0.010 (0.003)	-0.000 (0.000)	-0.010 (0.004)	-0.014 (0.007)	-0.010 (0.004)	1000
0.5	250	0.004 (0.003)	-0.000 (0.000)	0.110 (0.005)	0.172 (0.006)	0.110 (0.004)	1000
1.0	250	0.003 (0.004)	-0.007 (0.000)	0.277 (0.005)	0.436 (0.006)	0.277 (0.004)	1000
1.5	250	0.002 (0.004)	-0.019 (0.000)	0.404 (0.005)	0.668 (0.007)	0.404 (0.005)	1000

The robust SEs will of course not affect bias in the parameters, but the adding the random intercepts seems to debias all parameters. Again, I think in this example this is optimally so because the unmeasured X is a standard normal.

Table 6: Bias of all parameters for Varying Strengths of an Omitted Covariate When including a random intercept at the patient level (ordinal::clmm)

X Effect Strength	Sample Size	‘tx’ Bias (MCSE)	‘time’ Bias (MCSE)	‘yprev=2’ Bias (MCSE)	‘yprev=3’ Bias (MCSE)	‘yprev=4’ Bias (MCSE)	Simulations
-1.0	250	-0.004 (0.005)	-0.000 (0.000)	0.019 (0.005)	0.016 (0.007)	0.019 (0.004)	1000
-0.5	250	-0.004 (0.004)	-0.000 (0.000)	-0.000 (0.005)	-0.021 (0.007)	-0.000 (0.004)	993
0.5	250	0.004 (0.004)	0.000 (0.000)	-0.001 (0.005)	-0.008 (0.007)	-0.001 (0.004)	992
1.0	250	0.002 (0.005)	-0.001 (0.000)	0.017 (0.005)	-0.006 (0.007)	0.017 (0.004)	1000

The three (tentative conclusions) I would draw from all this are:

When using the markov model for a frequentist approach in an RCT, robust SEs seem to be a must, because there will always be unmeasured predictors.
When using a Bayesian approach, include random-intercepts as the default (?).
Not sure if these results actually translate to Bayesian, and my compute is too limited to check. However, I would assume that we’d end up overconfident if we leave out the random intercepts, given unmeasured predictors always existing.
Including predictive covariates is a must, not just for precision, but also frequentist ‘error’ control (unless using robust SEs).
I think this is concerning, because I don’t need to do this, e.g., for a ols for continuous variable IIRC. The markov models does not seems to be robust in that sense. Not sure about other models like CoxPH.

Sorry for the wall of text.

Completely agree. I would love Max to join the discussion. When I’m back from my holidays next week, I’ll set up a plan of more scenarios to simulate, and bring up possible collaboration in our research group.

In general, the scenarios that interest me (and probably also regulators) are:

How does omitting yyprev (lag 2) or more affect frequentist error rates and bias, depending on it’s predictive strength? Do robust SEs and / or random effect mitigate this ?
How does ignoring non-proportional effects of time or other covariates affect the above?
Non-proportional effects of tx do not exist under a nil-null Hypothesis so I doubt regulators care much about that.
And of course power, but IIRC Max has done a lot of work on that I it’s probably wise to not do duplicate work

All code can be found here.

Edit: I think it would also be valuable to not only focus on statistical properties, but different scenarios. E.g., in Influenza-like disease a treatment might make people leave the hospital earlier but lead to more relapse, leading to more time spent in a worse state. While a Cox-model with time to discharge or time to symptom resolution as outcome might have perfect frequentist operating characteristics, it would till lead us to making the wrong decision.

f2harrell · September 15, 2025, 12:48pm

Fabulous work!!

We still need to pursue the problem of what effect a missing predictive covariate has on curently used methods, and how to simulate data that can be used across different methods.

Bayesian mehods don’t care about bias per se, but along the lines of this I’d like to see how a missing covariate affects the Bayesian operating characteristics.

Bayes does run pretty well with random effects, but the effective MCMC sample size for the variance of the random effects is sometimes problematically low.

Johannes_Schwenke · September 16, 2025, 7:53pm

I had a couple of free minutes this evening. I took the markov data and turned it into a survival dataset, by taking the last (non-carried-forward) observation of each patient and fitting the simplest cox model on it.

surv_data <- data |> 
  filter(time > 0, yprev != 5) |>
  group_by(patient_id) |> 
  arrange(desc(time)) |> 
  slice_head(n = 1) |> 
  ungroup() |> 
  mutate(
    death    = if_else(state == 5, 1, 0),
    censored = if_else(death == 1, 0, 1))

m <- coxph(Surv(time, death) ~ tx, data = data_for_model)

FYI, the omitted predictor had logodds of 1.5 for a higher (worse) category in the markov data generating model, which resulted in a log hazard ratio of \sim1.96 in the cox model in this specific simulation. Omitting this predictor does not impact the ‘error’ rates.

Not sure this is the best way to test this…

Table: Cox Model: Comparison of Type I Error and Bias of Treatment Parameter with Missing Predictor X

X Effect (log HR)	Sample Size	Type I Error (MCSE)	Bias (MCSE)	Simulations
-1.96	250	0.045 (0.007)	-0.002 (0.007)	1000
-1.55	250	0.041 (0.006)	-0.003 (0.008)	1000
-0.89	250	0.037 (0.006)	0.001 (0.010)	1000
0.0	250	0.041 (0.006)	0.035 (0.011)	1000
0.89	250	0.041 (0.006)	0.003 (0.009)	1000
1.55	250	0.053 (0.007)	0.001 (0.008)	1000
1.96	250	0.057 (0.007)	0.006 (0.007)	1000

f2harrell · September 18, 2025, 12:06pm

This is very helpful. We also need to note any major power differences. Another kind of model whose alpha accuracy we may be interested in is a longitudinal binary model. The simplest version of that is a non-longitudinal binary logistic model. I’m pretty sure this has been studied but it would be good to know the effect on alpha of a strong omitted covariate. This may be helpful.

Johannes_Schwenke · September 25, 2025, 9:12am

I have been thinking a bit more about this. If we stick to a bayesian framework with blrm(), wouldn’t it actually be more prudent / sensible to use Bayesian Stacking as described here? (I haven’t tested whether this is currently technically possible with brlm(), just asking in general). Otherwise we’d put a prior of 0 on other specifications?

f2harrell · September 25, 2025, 1:00pm

Are you meaning in the context of Bayesian model averaging? Why do the stacking?

Johannes_Schwenke · September 25, 2025, 1:11pm

Yes. I said stacking because I would assume that we are usually in an M-open setting, i.e., in that none of the models we fit to the data are the true model, and then Bayesian model average apparently can perform really badly. However, I’m just taking Gelman and Co by their word because the math here is somewhat beyond me. But yes, initially I was interested in model averaging!

Edit: Should have written averaging of predicting distributions using stacking.

matthewtam · September 26, 2025, 8:43pm

Hello Frank and Team,
We have a question about how to deal in semi-parametric ordinal longitudinal model, with an ordinal outcome scale, when some levels of the scale can occur at any time point (i.e., clinical events), whereas others are investigated at discrete time points (at 12 m, 24 m, 36 m).

We are re-analyzing with this model a clinical trial assessing the efficacy of beta-blockers in cirrhosis progression. In our data, we have 5 states: no varices (y=0), small varices (y=1), high risk varices (y=2), hepatic decompensation (y=3) and death (y=4). 3 and 4 are our two absorbing states (end of trial events). The patients can only start with 0 or 1 (have to be compensated with no high risk varices at baseline). The first 3 states (y=0,1,2) are determined by endoscopies, however, these endoscopies are only routinely done every 12 months in the trial, while the absorbing states (y=3,4), which are clinical events, could occur at any time. Patients that progress in varices (develop small varices (y=1) or high risk varices (y=2) cannot regress, so the scale only progresses in one direction.

This raises two questions for us:

How do we handle the time variable, that if all was continuous we put in the time variable with splines. How does it change when information on the 0-2 part of the scale is intermittent? Should we put knots at 12, 24 and 36?
How to handle the intermittence of the varices data and the “unknowns” in between endoscopy. Between each 12th month period, we filled in the long database the value from the last known endoscopy, i.e. if a patient had y=1 at month 12 and y=2 at month 24, months 13-23 would have a value of y=1. However, in reality, it is possible that the patient actually progressed from y=1 to y=2 at any point between month 12 and month 24. Therefore we cannot say for sure whether any of the months in between should be y=1 or y=2. Moreover, it is also possible for a patient with y=0 at month 12 and y=2 at month 24 to have either y=0, y=1, or y=2 between month 12 and 24, leaving us with even more possibilities. We have seen in your videos that you put in [0-3] for those months, to reflect uncertainty, but we do not know how to input that in our Markov modelling process.

The problem is summarized in the comparison between the empirical occupancy states (big steps) and the Markov model SOPs (smooth transition of endoscopic defined states despite endoscopy is done every 12 months).

Our empirical data (above).

Matthew and Juan

Johannes_Schwenke · September 27, 2025, 9:34am

Hi Matthew, I’d expect some problems analysing your data with the Markov model.

Frank will correct me, but I think the transition model will estimate transition probabilities from 2 → 1 and 1->0 and 2->0. If you’d code them as absorbing the patients couldn’t transition out of that state at all, i.e., also not to a worse state like death, which is also not what you want. That’s the way the matrix multiplications for the SOPs are currently set up. I think the constraint ‘can transition to a worse state but not a better state’ is not supported / creates problems.

You can handle interval censoring using Ocens(), I used it in some code in earlier posts of this thread. However, you’ll run into issues if you have differential missingness in your outcome. E.g., you always know death, but not all the varice data. The state occupancy probabilities of death (cumulative incidence) will be overestimated. At least that’s what I expect. I think you’d probably fit the model only on data at 12, 24, 36 months etc. where you have complete data and make predictions for these time points only. Frank has done a ton of simulations which are linked throughout this thread. IIrc Max Rhode has also done some work exploring scenarios where you have data for some time points only and then linearly interpolating the SOPs. But that still leaves you with the issue of transitions being estimated that shouldn’t happen.

Edit: Actually, maybe if in you data a transition from 2 → 0 never happens the model correctly assigns a probability of 0. But I’d want to simulate that first.

f2harrell · September 27, 2025, 11:59am

These are good issues to deal with, and simulation is key. I believe that @Johannes_Schwenke’s last statement is true, that the model estimates will assign probabilities close enough to 0.0 for impossible transitions, even when absorbing states are not involved.

Interval censoring (or possibly left censoring in this situation) is the most elegant way to handle partial information, and this is supported by rms::orm and rmsb::blrm. Don’t let the measurement resolution dictate how you model the effect of elapsed time. Since time needs to be discrete in the rows of data it is probably best for your situation to measure time in weeks.

There is a problem when a censored response is needed as the previous state for the next uncensored response. This is probably best handled by multiple imputation (making use of known intervals to constrain the imputed values) but can be approximated by using modal categories. For example you compute the mode of y when y < 3.

Johannes_Schwenke · September 27, 2025, 12:38pm

@matthewtam I’d be interested to see the marginalized SOPs if you end up using interval censoring with the previous values (yprev) taken from multiple imputation. I bet that SOPs for death and hepatic decompensation will be overestimated, but I’ll happily be proven wrong.

Edit:

I don’t quite understand how one would implement this for the situation of @matthewtam . If someone had y = 0 at the fist endoscopy and y = 1 at the second, they switched from 0 to 1 at some timepoint between first and second endoscopy and we’d want to impute plausible values for the ‘progression’ with all y = 0 before and all y = 1 after.

f2harrell · September 28, 2025, 1:05pm

The last part sounds like constrained multiple imputation.

And a choice needs to be made between cross-sectional (usual) imputation and strictly within-person imputation working both forwards and backwards. Stata has a nice macro for doing that; not sure if that is implemented in R.