Appropriate statistical tests for my study?

Background: Data analyst. Economics BA. Last time I ever performed real statistical analysis study is over a decade ago. Typical highest level of understanding of statistics is multiple regression/ANOVA.

Issue: Uncertain of appropriate statistical test as many issues in play. I believe the actual appropriate statistical tests are beyond what I have learned in college and so I would love verification if I am on the correct track. An issue is that I believe there might be a lag time between an actual effect and when when the predictor may have happened at an earlier time frame. Also I am working with names and record visits where the number of visits greatly vary. Due to the nature of mental health being very random, as argued by my boss, I am unsure whether I should perform the study at all. I am also unsure if I should convert my data into rates, like event/TotalVisits, so to potentially fit other statistical tests.

Circumstance: My proposal of my study idea is due Friday and my boss wants the methodology complete. I work for a crisis center.

Goal: Find out the relationship between two events: occurrence of dysregulation and participation of a workshop. It is my guess that more participation of a workshop would lead to less occurrences of dysregulation. I also believe that workshops may have a lag effect on dysregulation.

Data: VisitDate/Name/Dysregulation Event/WorkshopParticipation are the fields I believe would be used. Both dysregulation and workshop would be true and false for that date and person. Bob may visit only 2 times and Sally may visit 60+ times. Workshops are not always available. It is possible for both a dysregulation event and a workshop to happen. I have over a decade worth of data so for the sake of argument, lets go with 1000 people.

The tests I believe I may conduct are Chi-squared, linear regression. The tests I’m guessing that are actually more appropriate but yet I have not studied are: cross-sectional regression, time series regression, meta analysis, logistic regression, longitudinal analysis. Once the study is approved I am able to self-study to learn the appropriate statistical tests as needed. Any help is greatly appreciated.

I would recommend study of Paul Rosenbaum’s work on using an observational study to approximate an RCT. It would complement the recommendations of @f2harrell noted below.

Since you do not control the treatment allocation mechanism, you bear a burden of proof to demonstrate that no explanations other than the hypothesized treatment, are plausible. You will need to think very hard about the potential biases and confounding variables in order to create plausibly exchangeable groups for comparison with the data you have available.

As for actual regression models, I’d give strong consideration to logistic models, with log linear probability models also worth considering. See this thread:

Some related threads and posts

  1. Regression Modelling Strategies by Frank Harrell
  2. Ordinal Models for clinical trial design by Frank Harrell
  3. Links to recent papers on design of observational studies from statistical POV:

Rosenbaum, P. R. (2015). How to see more in observational studies: Some new quasi-experimental devices. Annual Review of Statistics and Its Application, 2, 21-48. (PDF)

Greenland, S. (2005). Multiple-bias modelling for analysis of observational data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 168: 267-306. link

Branson, Z. (2021). Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Studies 7(2), 1-36. doi:10.1353/obs.2021.0031. link



As far as covariance adjustments, I think that might be over my head to execute, but you are absolutely right. I think I will just put it as a known flaw until I can absolutely grasp how to execute the adjustments.

On second note, Thank you again for your advice on logistical or log linear, but I am wondering now if I should do log-log instead of log-linear because each patient could only have 1 event of workshop and 1 event of dysregulation per visit so I don’t quite understand why I should sum the count of workshops on a patient rather than converting into rates as well. I will definitely ponder things more as I am very rusty on statistics.