I have a data set of about 5,000 participants, which included their physical examination data such from 1990 to 2018. Cardiovascular disease(CVD) incidence has been followed-up to 2018.This is data from observational studies. I want to design a study to analyze the trajectory of triglycerides and related risk of CVD.

My thoughts are 1. analyze the trajectory of triglycerides using the Jointlcmm funcation in the lcmm package in R; 2. use G-formula in the gfoRmula R package to calcuate the hazard ratios for CVD incidence according to these classes from trajectory analysis. Because this is an observational data, I want to use G-formula to estimate the effects, which seems be more close to causal effects.

I’m not a professional statistician. I’m not sure if this is a good idea (or an idea that doesn’t make sense for clinical practice due to poor statistics). So I am here for some help. I can only get observational data. I want to come up with some acceptable results if possible using these data.

First of all, I want to hear professional statistician opinion.

Also, there are some questions.

It seems that I need multiple physical examination data(triglycerides) to fit trajectories, but some participants only had one or two participations during the follow-up. Should I exclude the ones with less participations. If so, what should be the minimum number of participations.

This is not a good place for “classes” of trajectories but is a good setting for smooth nonlinear modeling of trajectories in continuous time.

It wasn’t clear why you made the jump from standard longitudinal analysis (using all available data, with variable number of observations per patient) to very complex methods.

I want to confrim which one you think it is complex, G methods only or trajectory + G-formula? Because G-formula needs a treatment strategy variables, so I try to get the classes from trajectory analysis and use the classes as treatment strategy variables. Sorry for my poor knowledge, so it is a strange thought?

Very glad to have your suggestions. I want to confirm whether the standard longitudinal analysis you said is mixed model (or GEE or other similar model)s? And, could you give me some information about smooth nonlinear modeling of trajectories.

Don’t rely on “classes” in any way. These are ultimately arbitrary and hide a good deal of within-class heterogeneity. For complexity I was referring to g-methods. Analysis of trajectories in and of themselves do not need that complexity. When you want to do causal inference with time-dependent confounding, the g-formula comes more into play.

I will avoid use “classes” in future studies. I want to confirm whether you think g-methods are an accepted approaches in this situation. (sorry for my repeated confirmation).

As I’ve stated previously it depends entirely on your goal. If your primary goal is not causal inference from observational data then I fail to understand the use of g-methods in this context.