Analysis of data with various complications

I have a dataset to analyse and I am not sure how to deal with some of the complications in the data.

The data comprise about 4000 observations of patients (and nurses), with indicators of whether certain events E1-E15 occurred, and a number of other variables.

Powers that Be want some estimate of the number of events per observation overall, and in various subgroups, as well as some analysis of what factors (eg patient age, nurse experience) might be associated with events. No specific hypotheses; this is descriptive/exploratory.


1. Observations have different types.

For type 1, only events E1-E9, E15 are possible, the others are not applicable.

For type 2, only events E1-E5, E7-E15 are possible, E6 is not applicable.

So type 1 observations have a maximum of 10 events, while type 2 have up to 14 (I think actual maximum in the data is 4).

When possible, events E10-E12 occur much more frequently than other event types.

2. Relative frequencies of type 1 and type 2 probably vary between subgroups of interest

For example, type 2 is extremely rare in young patients, but is much more common with older patients.


Is an overall summary of events/observation meaningful?

Can I make valid comparisons of events/observation between subgroups where relative frequencies of
type 1 and type 2 observations vary? How? Some sort of stratified analysis?

Additional complications

  • observations are not really independent: the ~4000 observations involve about 1300 patients and 300 nurses
  • event types may not be independent. I strongly suspect certain events are more likely to occur together

Any advice welcome! So far I have been analysing as binary (1 or more events vs no events), but people are asking for total events.


1 Like

I’m inclined to explore the connections between “no specific hypotheses” and “just asking questions”. @f2harrell sometimes warns against “torturing the data”; might pretending to approach data “without hypotheses” amount to gaslighting them?

As David alluded to, there are just too many questions here. I suggest getting the clinical experts to come up with a hierarchical ranking scheme for all the events and to create an ordinal outcome scale that records the worst category of event that occurred. Then use a single model to analyze outcome severity tendencies.