I have a cohort of 2000 cancer patients with an incidence of thrombosis of 7%. We want to investigate genetic factors associated with thrombosis. Since transcriptomic studies are very expensive, we have only been able to study the tumours of 50 patients with thrombosis and 50 patients without thrombosis. Therefore, it is a nested casecontrol design over the entire cohort, 1:1 ratio. The study shows that the expression of several genes is significantly associated with thrombosis (the algorithm uses ttests adjusted for FDR). I could simply report whiskers & plots with each expression level, or even report odds ratios. However, a timetoevent analysis would be cooler, which has the peculiarity that death is a competing event. The main problem for plotting this figure is that the prevalence of the cohort is 7%, but my sampling has brought it to 50%. Is it possible to estimate the cumulative incidence rate at fixed rates with this sampling?
My problem must be very frequent, insofar as applying omics studies to entire cohorts is impossible. Any ideas?
i have done it with less, but it might mean a tabulation of CI estimates, rather than the nice looking figure (eg if you do a stacked plot including, competing risk, event, and no event the no event takes up all the space). i guess there are bigger issues eg how are you defining timetoevent?
Perhaps you can use inverse probability weighting techniques for this, which seem to be supported by some packages from a quick search.
But I was wondering what exactly is it that you want to show/calculate? Just the cumulative incidence rates at one or a few timepoints? Or do you want to generate some graphical results, e.g. sort of a KaplanMeierlike plot with curves stratified by the genetic variant?
Intuitively, I am somewhat worried about the power/sample size and the uncertainty in your estimates after weighting/weighing. Iām not an expert in the topic, but I know weighting costs statistical power and your sample size is not that large, which could be further compounded by the frequency of your genetic variants (i.e. although the cases have a higher frequency of the variants, this frequency might still be comparatively low).
You were interested in discovering genetic variants related to thrombosis risk in cancer patients and youāve succeeded in this. Maybe thatās enough for your current study/project and you could measure the specific variants you discovered in more people now (hopefully at lower cost than just measuring all the variants) and postpone the sort of calculations/visualizations you want to do to a next project?
i think thatās if you wish to make causal claims, according to a paper @f2harrell tweeted some time ago: Analysis of timeātoāevent for observational studies: Guidance to the use of intensity models inv prob weighting discussed in sectn 5 re causal inference where they suggest ācausal questions are most natural and relevant for modifiable variables for which a hypothetical randomized study could, in principle, be doneā
Thank you for your answers. The aim is mainly causal inference: to select promising genes associated with thrombosis to study in depth later with other techniques. For that I have used ttests with FDRadjusted pvalues. For this purpose I could simply make a whisKer and box plot with the expression levels of each gene grouping cases (thrombosis) and controls (no thrombosis), with the pvalues on top. However, for descriptive purposes, it would be cool to plot the estimated cumulative incidence curves over time, stratified by expression levels. For example:
Should I have the entire cohort it would be simple to plot something similar, taking into account death as a competing event. However, since the sampling of the casecontrol study has altered the true incidence of thrombosis, the curve would have to be estimated accordingly. Perhaps stacked bars could be made at time points of interest, showing both events. It would probably be uglier than an estimated incidence curve. In response to the other proposal, using inverse weighting, what R library does that?
Inverse probability weighting methods can be used to estimate the absolute risk of outcome in the nested casecontrol studies, provided that you can correctly calculate the sampling probability.
However, you should be aware that in casecontrol studies controls are selected for one outcome. So, when you want to account for the competing event(s), or study two or more outcomes, this becomes more complex. At the time when I was reading extensively on this subjects, I found the following two papers very useful:
 Nested casecontrol data utilized for multiple outcomes: a likelihood approach and alternatives

Nested caseācontrol studies: should one break the matching?
Or, if you can afford to change the sampling method: 
Nested CaseControl Studies in Cohorts with Competing Events
Of note, while this sampling would allow for estimation of cumulative incidence function, the estimated odds ratios approximate the subdistribution hazards. So, if your aim is causal inference, this method would not be optimal.