Using trajectory analysis and G-formula in observational data

I have a data set of about 5,000 participants, which included their physical examination data such from 1990 to 2018. Cardiovascular disease(CVD) incidence has been followed-up to 2018.This is data from observational studies. I want to design a study to analyze the trajectory of triglycerides and related risk of CVD.

My thoughts are 1. analyze the trajectory of triglycerides using the Jointlcmm funcation in the lcmm package in R; 2. use G-formula in the gfoRmula R package to calcuate the hazard ratios for CVD incidence according to these classes from trajectory analysis. Because this is an observational data, I want to use G-formula to estimate the effects, which seems be more close to causal effects.

I’m not a professional statistician. I’m not sure if this is a good idea (or an idea that doesn’t make sense for clinical practice due to poor statistics). So I am here for some help. I can only get observational data. I want to come up with some acceptable results if possible using these data.

First of all, I want to hear professional statistician opinion.

Also, there are some questions.

  1. It seems that I need multiple physical examination data(triglycerides) to fit trajectories, but some participants only had one or two participations during the follow-up. Should I exclude the ones with less participations. If so, what should be the minimum number of participations.

  2. I am reading this paper.I could not understand the part “Generating Covariate Histories”. There are there patterns for pre-coded functions of history for a covariate: lagged, cumavg, lagavg. I don’t quite understand how to select them.
    .gfoRmula: An R Package for Estimating the Effects of Sustained Treatment Strategies via the Parametric g-formula

Besides, If there are any other good methods to handle these observational data with time-varying varibales, please tell me.

Thank you very much.

This is not a good place for “classes” of trajectories but is a good setting for smooth nonlinear modeling of trajectories in continuous time.

It wasn’t clear why you made the jump from standard longitudinal analysis (using all available data, with variable number of observations per patient) to very complex methods.

Take a look at Regression Modeling Strategies - 7  Modeling Longitudinal Responses using Generalized Least Squares and Biostatistics for Biomedical Research - 15  Serial Data

Thank you very much professor.
Our research group has barely used these longitudinal data (most time only use baseline single data), so I want to use those data and found some papers ((for example, An introduction to g methods) which recommend G methods to evalute the effects of exposure in observational data. Some well-known observational studies such as the ARIC study also use this method. For example, Accounting for Time-Varying Confounding in the Relationship Between Obesity and Coronary Heart Disease: Analysis With G-Estimation: The ARIC Study.
(I know this may not be good, but we always choose the analysis method by reading the papers of other good research groups.)

I want to confrim which one you think it is complex, G methods only or trajectory + G-formula? Because G-formula needs a treatment strategy variables, so I try to get the classes from trajectory analysis and use the classes as treatment strategy variables. Sorry for my poor knowledge, so it is a strange thought?

Very glad to have your suggestions. I want to confirm whether the standard longitudinal analysis you said is mixed model (or GEE or other similar model)s? And, could you give me some information about smooth nonlinear modeling of trajectories.

Don’t rely on “classes” in any way. These are ultimately arbitrary and hide a good deal of within-class heterogeneity. For complexity I was referring to g-methods. Analysis of trajectories in and of themselves do not need that complexity. When you want to do causal inference with time-dependent confounding, the g-formula comes more into play.

Thank you very much professor.

I will avoid use “classes” in future studies. I want to confirm whether you think g-methods are an accepted approaches in this situation. (sorry for my repeated confirmation).

As I’ve stated previously it depends entirely on your goal. If your primary goal is not causal inference from observational data then I fail to understand the use of g-methods in this context.

Yes. I am trying to do causal inference.