Relaxing Assumptions and Targeted Estimands with MOST

Hi all,

I would very much appreciate your thoughts on an idea before diving into the technical details and potentially getting stuck in tunnel vision. The idea is based on several points, so I’ll set the context first:

  1. I have a background in Generalized Pairwise Comparisons (Win Ratio and alike). I recently submitted a paper with Prof. Harrell and colleagues on the contrast between GPC, DOOR and MOST (Markov Ordinal State Transition model - discussed here RMS Semiparametric Ordinal Longitudinal Model - data analysis / models - Datamethods Discussion Forum). All that to say that I tend to think of Y in RCTs (what happens to patients) as inherently multi-component.

  2. I am drawn to MOST as the most faithful way of capturing Y (setting aside cases where ordering is not meaningful over short intervals). Two aspects appeal to me in particular: it retains more information than approaches such as the Win Ratio, and it does not impose a specific estimand.

  3. I have also recently spent some time on the baseline covariate adjustment literature, including TMLE. My question is not about TMLE per se, but rather about borrowing its philosophy. In my understanding, this consists of two steps:

    • first, modeling a component as flexibly and accurately as possible (in TMLE, the outcome regression for which they allow crazy ML methods if one thinks that helps),

    • then, a “targeting” step, which brings that flexible model back into a framework that yields valid inference for the estimand of interest. It essentially tweaks the initial model as little as possible but in order to be back on the objective (usually the ATE in this literature)

Usually, that literature focuses solely on the X, trying to be “the best” on that part without discussing anywhere the Y.

  1. In practice, I operate in an environment where change has to be incremental. A pragmatic first step, as I see it, is to leave the standard estimands unchanged (e.g., a risk difference at a given time point). I am also working within a frequentist context, with its associated constraints such as multiplicity. While I am aware of Bayesians’ views on this, I am trying to move things forward gradually.

Against that background, here is the idea I would like to explore:

  • First, represent the data using MOST, to capture the patient trajectories as faithfully as possible. The components would be the ones that allow me to go back to primary and secondary in usual trials (along all relevant intercurrent events).

  • Second, fit an intermediate (“parent”) model that is optimized for these trajectories as a whole, and thus naturally introduces structural assumptions that allow information to be borrowed across time points and outcome levels.

  • Third, starting from that parent model, introduce an estimand-specific adjustment: one parent model, but different estimands (primary, secondary, etc.).

The key point is that I would like this adjustment to remain small: not a complete re-specification, but rather a minimal modification that nudges the model toward the estimand of interest. Conceptually, I imagine this as optimizing something akin to an MSE criterion: accepting a small increase in bias in exchange for a larger reduction in variance due to the structure imposed by the parent model.

There are two motivations behind this:

  • The assumptions that work well globally (e.g., proportional odds) may not be optimal for a specific estimand. For instance, if the estimand is tied to a cutoff, one might want to relax proportional odds except locally around that cutoff, so that behaviors far from it do not introduce unnecessary bias.

  • At the same time, maintaining a shared parent model and constraining how much it can vary across estimands could offer gains from a multiplicity perspective (which, again, reflects the practical constraints of my setting).

Lastly, an imperfect analogy that comes to mind is subgroup analysis: rather than analyzing each subgroup independently or pooling everything, the best approach is often somewhere in between. Here, I see a similar trade-off between a fully unbiased but inefficient approach, and a “targeted” version of MOST that introduces structure and accepts a small bias.

Stepping back, the ambition would be to say:

i. I aim to represent Y as faithfully as possible, what actually happens to patients.
ii. I aim to make full use of what is known at baseline (X).
iii. I aim to preserve the estimands that stakeholders are used to, even if more informative alternatives may be possible in the future.
iv. The statistical machinery required to achieve this sits on my side.

Before taking this further, I would be very interested in your view: does this overall strategy make sense to you, or do you see fundamental issues that would make it untenable?

Many thanks in advance!

3 Likes

A warm welcome to datamethods Michael. I’m glad you are here.

This is an extremely important topic for which a full discussion will lead to our developing better statistical analysis plans.

I want to first state that the promise of targeted maximum likelihood estimation has not been fulfilled. Model uncertainty is always in play, and TMLE does not properly take it into account. TMLE is able to provide trustworthy point estimates, but as shown in this paper is not able to correctly estimate uncertainties in point estimates (it very much underestimates standard errors). The authors had to put the entire TMLE process inside a bootstrap loop to get valid SEs. One TMLE analysis is very computationally expensive, but doing it 200 times may be out of the question. This also points out a more fundamental problem with the usage of complex predictive methods based on adaptive (rather than pre-specified) methods when used in pivotal studies.

In addition to the challenge of TMLE in estimating uncertainties, it has not been adequately demonstrated that TMLE improves the mean squared error in estimating a targeted effect over what can be achieved by simpler methods. As an aside, the one time I tried to use the super learner used by TMLE, the random forest component of the super learner overfitted the data by a catastrophic amount, and the overall super learning gave far too much weight to this one model, resulting in a badly overfitted super learner ensemble.

When formulating an overall attack plan, it is usually important to consider all that is known about disease processes and their (possibly differential) effects on different outcome components. Sometimes a unified model (like the proportional odds model) works very well, and sometimes modeling is best done separately for different components. This is exemplified by the research of Phil Boonstra @philb discussed here. In his example, there is reason to model mortality separately from nonfatal outcome components. This is especially true if we expect risk factors for mortality to be much different from risk factors for nonfatal components.

In other cases, as shown here it may be better to have a unified model but to allow pre-specified exceptions, here using partial proportional odds for treatment. This approach has the advantage of letting you specify how much borrowing you want to allow for learning the treatment effect on mortality vs. its other effects. This borrowing, through a skeptical prior on an X\\times Y interaction effect, boosts the effective sample size (e.g., when it’s difficult to get a sample with enough deaths to learn about mortality effects of treatment on its own).

A proportional odds model for which the PO assumption is strongly violated for treatment will still tell you which treatment is better, in terms of moving patients to better outcome levels. It may provide a wrong standard error of the log odds ratio—this needs to be studied. But the PO model will give you inaccurate estimates of outcome risks or treatment benefit for certain levels of the ordinal outcome Y.

I am trying to avoid specific targeting and instead spending more time on overall model specification.

6 Likes

Thanks a lot Frank, there are several interesting points here. Let me try to distinguish them and build on what you wrote:

  • TMLE: while I am very interested in discussing the technique and the fact that it has not delivered on its promises (I think this is very important as I have seen/heard some ideas that the usefulness of TMLE will not lie in the usual tabular covariates in RCTs but rather its ability to also incorporate unstructured raw baseline information (CT-scan, EEG, …) - an idea that I think is scientifically appealing if one is obsessed with being as faithful as possible to the X), I did not mean to focus on that now. Instead, I wondered whether one could borrow at a high level the philosophy: be as good as possible for an intermediate model, and then focus the model back to the estimand of interest (or multiple estimands in my case).

  • That being said, I guess your main point is that there is no shortcut to properly accounting for model uncertainty - a point that is very relevant for the two-model framework I had in mind (parent vs tailored to estimand of interest).

  • I agree with your point about no unique overall attack plan: there is no mathematical shortcut to translating all that is known about the disease process. So, to clarify, my intention is not to develop a one-size-fits-all type of thinking. I will provide an example below so that we can discuss more specifically, narrowing the scope of the topic.

  • I am very happy you highlighted the comment by Phil Boonstra. I think this merits quite some attention and I haven’t digested consequences yet. I agree with your point about differences in the effect of risk factors on terminal events vs on other outcomes driving some part of the consideration of a single model vs joint modeling. But is that all? If I do not expect a difference in the effect of risk factors, and I am interested in the terminal event as well, how would I decide to go for joint modeling rather than a single ordinal model that acknowledges that the terminal event is indeed worse? For the latter, can I buy in any way any protection towards the highlighted issue of neighbor transfer from worst state to adjacent state rather than all relevant states? I suppose this is less of a concern if the terminal event is incidental (eg death in an early neurology study) and I want it to be included in a “or worse” type of summary?

  • I take your point that global assumptions such as proportional odds are often not critical for the kinds of high-level summaries we are interested in in the first place (e.g. overall improvement across outcome levels). So I understand that from that point of view, my intention is not very convincing. However, recall my framework: stakeholders are usually very attached to some definitions of estimands. So I try to start from their familiar estimands, and embed them within a broader modeling framework (e.g. MOST) to stop the massacre about the estimand dictating how we record the Y, rather than asking them to redefine the problem from scratch.

Perhaps an example will make the debate more concrete. Suppose I have a 28 day study where patients experience daily one of four ordinal states:

  1. at home,
  2. hospitalized,
  3. on ventilator,
  4. dead.

The current approach in MOST is to say: “let’s model that longitudinal ordinal data, and then you pick the estimand of your choice - globally you should not overreact to assumptions like PO”. Now, I live in a world where usually the estimands are already defined. Suppose I have 2:

  1. The probability to be at home at the end of the study,
  2. The mean time spent on ventilator or worse.

What is typically done today:

  • the first is analyzed via dichotomization, usually only looking at the time point of interest (don’t throw the rocks at me)

  • the second via counting days in certain states

  • with essentially no connection between the two analyses

How do we change that? My thinking is that if I promote MOST as such, I will get pushback because of the fear of the impact of global assumptions (eg PO) for the local estimands (risk difference at the specific cutoff “at home”). So I was wondering whether, in order to mitigate that, I should think about a layered approach where models would be “tweaked” (this should be another word that makes it look less frightening wrt pre-specification) for each estimand:

  • For the “at home at day 28” estimand: optimize more local structure, both in time and outcome value (e.g. borrowing more strongly between adjacent clinical states such as “home” and “hospitalized” than between “home” and “death”)
  • For the “time on ventilator or worse” estimand: as I look at something over the course of the whole study, I could imagine optimizing the time aspect differently than for the first estimand.

In the end, I suppose my question is “within a framework that tries to be as faithful as possible to the Y (eg MOST), how do we reconcile fixed, local estimands with a global model? Should we think about adapting/targeting the model to the several estimands?”

2 Likes

I am not used to developing a sequence of models and always try first to accommodate outcomes in a unified way. My goal is to have a unified model that is flexible enough to be accurate for all the main study goals and not so flexible as to overfit. There are two cohesive “grand strategies” to accomplish that:

  • Spend a lot of time eliciting priors for parameters representing departures from assumptions. Here this refers mainly to non-proportional odds (partial proportional odds) parameters.
  • Posit a model that has PO for treatment (but not for time) and a model that allows much different treatment effects for some of the major outcomes. Choose the model that is most likely to predict future patient outcomes better, by choosing between the two models which has lower AIC.
3 Likes

Thanks for this interesting thread @mdebacker.

I’m wondering when we/you would say a MOST is faithful enough to the raw trajectories / the raw SOPs. What criteria do we use?

E.g., in the example below, the marginal probability of being in state 1 is underestimated by the model at all time points, sometimes quite substantially, and the probability of being in state 2 is being overestimated. (If you look closely at the plots in @f2harrell 's and Max Rohde’s paper on MOST, this happened in the ACTT-2 data as as well, just not as much).

I haven’t managed to relax the model in a way that still converges but results in substantially less bias in the SOPs (or there’s a mistake in my code):

If we are interested in a difference between groups in time spend in state k, would you abandon MOST?

Thanks for the follow-up, Johannes.

I think this is exactly my point (although my point is still a question, to be fair).
I don’t have a criteria in mind yet for MOST being “faithful enough” to the Y, but I’m sure this can be further debated on its own. What I had in mind is your last concern: should we leave aside MOST if we are only interested in something local (eg, time spent in a specific state)? My gut feeling is no (even though your example challenges that): I still believe that in general there is interest in having an overall model (the one Frank would like to be flexible enough, yet not so flexible as to overfit) even if your question revolves around something local. I can see two reasons for that wish of a global model:

  1. We are never interested in something local only, we are interested in several things that could all be deduced from a general MOST model. This is not directly answering the question, but putting a bit of perspective.
  2. Even if we only looked at one question only, I believe the structure that MOST brings through the model is in general beneficial in terms of bias variance trade-off. Though, again, your example challenges this.

To put a bit of “meat” on my thinking, though this is debatable of course, I was thinking about having one parent MOST model (fine-tuned from a global perspective), that would also be (hyper)parametrized by a quantity (or several) I could then fine-tune differently for the different local questions I would be asking. On the top of my head, something in the spirit of a fused Lasso for the partial PO wrt outcome levels (penalty for adjacent coefficients being likely the same). Perhaps optimization of the hyperparameter(s) would then lead to different children MOST that would be more tailored for local questions (eg for time in state k, I will likely end up with PO for states around that one and then no PO for states further away). But, to complicate and have fun, I would also constrain the amount of fine-tuning wrt the parent model (overfitting concerns and multiplicity considerations if several children models end up being quite similar).
Note that this is very similar to Frank’s second suggestion in his last message, I’m just being stubborn by sticking to a ‘parent’ model rather than using two models as he suggested and choose in a principled manner the one that is most appropriate.

Now, I’m intrigued by your example. Could you specify what is partial in the PPO model in the third column?

Note: to simplify a bit, my suggestion can in a first place be put in a non-longitudinal setting. If we have an ordinal outcome, how can we approach several local questions (one specific outcome level) while still trying to recognize similarity in the treatment effect across outcome levels whenever that is appropriate?

I have to confess that a fused Lasso is somewhat above my comprehension.

I fit the model with rcs(time, 6), so five basis functions to model time. The coefficient of each of these basis functions is allowed to deviate linearly from PO across the thresholds, so two parameters per basis function. In the second column, only the first basis function (which, when using a left truncated power basis is the linear effect of time IIRC), is allowed to deviate linearly from PO.

My intuition, after having fit quite a few Markov models on various datasets by now, is that the models will end up very similar, but let’s see.

Oh, I thought you were aiming for partial PO for outcome levels in the third column, but that’s not exactly what you are doing, right? Is there any flexibility in your model for a different treatment effect on state 1?

Along those lines, @Johannes_Schwenke I need to know an exact source of the lack of fit, i.e., identify a specific parameter that is missing from the model.

This is a simulated null scenario, so PO for \beta_{tx} is true. The simulation is pretty convoluted, but essentially it’s a hidden Markov model (with random slopes) that defines sampling probabilities for each state, for each day, for each patient + an exponential model on top that chooses days each patient is allowed to change states, except for transition to the absorbing state (death), which can always happen.

But I might be misunderstanding your question: allowing for a separate parameter for the first threshold only (between State 1 and State 2) for all basis functions of time also has only minimal impact on marginalized SOPs. So one common OR for all thresholds + one deviation from that OR for threshold one, for all basis functions. In fact the impact is so small that I’m having doubts about the parameters being correctly used in the calculations for the SOPs. I’ll check.

Edit: The code seems to work correctly. Happy to share the dataset (or you can also simulate it yourselves using the package), if someone else wants to experiment.

Edit: Of course the first-order Markov assumption is being violated to a pretty strong degree with that setup. That’s on purpose. I was hoping that by relaxing the first-order model in other ways you could compensate this, but maybe not.

I went back to Rohde et al 2024. Marginalized SOPs for state 1 are underestimated to a similar degree as in the simulated data above, if compared to the empirical SOPs. But I’m not sure if this is even a good check for model fit, compared to variograms, correlation plots, etc.

Also, I doubt this would matter very much for most conclusions drawn from the model, at least not directionally. Not sure what a regulator’s perspective would be…

Very interesting, once again.

My feeling is that we are mixing two elements : a local behavior (what happens for state 1) vs a global behavior (variograms, etc). I’d argue that global fit is what we would be interested in initially, with the reassuring element that our assumptions like PO across outcome levels can be grossly violated without altering at least some information (the direction of the treatment effect). So, to your point of adressing fit, I would not diverge from the global checks first. But if we are now looking at things locally, eg landmark probabilities in a specific state, then I think we need to be careful that the null is not a global null : there can be no effect for my state of interest while there is an effect for all other states that will ´contaminate’ my local effect if I’m not careful about that. If I recall, you had something similar in another study but in the reverse: only an effect in one state vs no effect in the other ones, so including the latter dilutes your capacity to detect what you are interested in if you allow in MOST for some structure across outcome values.

I keep circling around this concern that what is appropriate globally may not be locally, so that we need to pay attention to the way we introduce MOST. If we are in a complex setting (eg rare disease) and we would already be thrilled to say ‘there is something’, then I would not think about being picky with local questions. But in other settings, my feeling is that we need to think about a principled way to use global structure whenever appropriate, while buying some protection when it is not for local questions (be it because it’s going to ´inflate´ alpha or dilute ´power’ - I know this is not sufficiently well-formulated, but I’m sure you’ll understand what I mean).

As I wrote before, this is surely not only mathematical, as we need to have information about the disease process to be able to anticipate best whenever structure (borrowing across outcome values) may be appropriate. But I also anticipate we lack some methodological tools to do that best as well (in reference to the comment by Phil Boonstra mentionned earlier in this thread).

What I lack most (MOST) are some case studies to refine this and make it concrete. More to follow soon hopefully

1 Like

Random slopes induces a pretty exotic correlation pattern. What I would love to have is a simulated dataset from a Markov-1 model (with possibly random intercepts) where we know we can model that part, but that we still can’t get SOPs right for one state. This may be covered by @philb 's work, and perhaps you have to relax the PO assumption with regard to Y=y-1 to get Y=y right?

@mdebacker - great discussion.