Statistics Agent Skills for Large Language Models

I’ve recently been experimenting with AI ‘agents’, i.e., Claude Code or other LLMs via Opencode. It’s, imo, very good at automating boilerplate code, especially data wrangling, plotting, cleaning up code for packages etc. In general, far superior to copy and pasting from chat interfaces. Ofc, there are various downsides, but I assume that most researchers will adopt this in one form or another.

A very useful feature are ‘skills’. These are local instructions (markdown files), LLM access depending on the context of the current chat. These ‘skills’ can be very sophisticated and cover quite complex workflows. One could, for example, instruct the LLM to prefer certain analysis / modeling strategies over other, with very concrete examples, code, validations strategies, etc. However, the current statistics ‘skills’ that one can find online (here, here), are very generic, or even read like AI generated slop, mostly or only use Python libraries which are no way close to R libraries for most non-ML applications. This will of course not stop people from using these and increasing the output of subpar research.

I think it would be very worthwhile to create such ‘skills’, but which actually support the implementation of best practices, e.g., suggest and implement strategies found in RMS from @f2harrell. It’s probably also a good idea to provide such a ‘skill’ when releasing a package for R, to make correct and widespread application more likely. Of course there are downsides: people will not read the original sources as often, leading to worse understanding, less traffic to websites hosting these books etc. However, the alternative is that people use LLMs to produce worse analyses, and still don’t read up on things.

These better ‘skills’ would of course not stop the increasing flood of articles that should not be published, but it might be able to increase the quality somewhat. Curious to hear your thoughts.

6 Likes

I’ve been also experimenting AI skills. I am currently focused in a complex clinical prediction model project, so I created a skill to simulate a “CPM specialist”. I created a references sub-folder within the skill folder in which I put the most important papers separated by topics (eg, TRIPOD guidelines, missing data imputation, sample size calculation, etc). I also imported multiple R code shared in actual CPM papers in another folder “code_references”. Finally, I created a SKILL.md describing what I expect from this skill, but I’m still improving this.

It’s been an interesting experience, but I noticed it’s very imprudent to use such skill without a very in-depth knowledge about the subject. You must always take everything the agent says with grain of salt.

Assuming you have in-depth knowledge about the analysis in hand, I think it’s a good resource, it acts like a second brain and can provide interesting ideas.

5 Likes

I am glad to see this subject opened up. It saddens me to say this but the statistical education of statisticians, and even more so epidemiologists, is on the average quite week, even to the point of still seeing stepwise regression used as a primary tool by many, and still seeing a belief in linear effects of predictors. So I think that developing trustworthy “skills” may be worth the trouble as it is likely to improve statistical practice of researchers by accident and painlessly other than the key fact that as @arthur_albuquerque said “You must always take everything the agent says with a grain of salt.”. On the other hand, LLMs are already above average of statistical practice, so the less-skilled researchers will benefit even without the “grain of salt.”

I’ve been saying that LLMs are ahead of someone with an MS in Statistics who does not engage in lifelong learning. I still believe that.

Another angle on this that is perhaps worth developing is a skills acquisition set, i.e., a LLM applied statistics teacher.

4 Likes

I just wanted to highlight the paper below, relevant to the discussion, in case you haven’t seen it. And I fully agree with @arthur_albuquerque on the need to be an expert in order to use these tools.

Dobler, D., Binder, H., Boulesteix, A., Igelmann, J., Köhler, D., Mansmann, U., Pauly, M., Scherag, A., Schmid, M., Al Tawil, A., & Weber, S. (2025). ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations. Statistics in Medicine, 44(23–24), e70263. https://doi.org/10.1002/sim.70263

I have had mixed feeling on leveraging LLMs for statistical work. On one hand, it has improved my speed/efficiency (and, sometimes, laziness) on the more mundane things. These are often related to code writing, repetitive tasks etc. Troubleshooting code/output has been admittedly easier with LLMs.

One the other hand, it falls short on statistical and logical reasoning. Suggestions for statistical approaches feel dated when not outright wrong. Sometimes, supplying relevant papers for context and then asking questions is more effective, but it mostly works in narrow situations. I’d welcome an expert curated content the LLM can draw from, though I don’t know if that would necessarily result in better output without re-training the entire LLM from scratch.

I’m increasingly convinced (my evidence remains anecdotal) that LLM training corpora are saturated with statistical content that is outdated, flawed, or simply wrong. I also feel that these corpora over-represent machine learning and pure prediction applications at the expense of explanatory, inferential, causal, study design, interpretation etc. When we query an LLM about modelling for example, one may be retrieving the accumulation of million of tutorials, notebooks, posts, competitions, GitHub projects etc that never distinguished among those outcomes (or assume the outcome is only an ML prediction model). Often, when one does not provide any context, the output is Python, and that already says a lot about the implied approach.

9 Likes

I think this is the way to go.

As an independent researcher, my workflow for AI in biostats has been:

  1. Extensively read papers on topic XYZ
  2. Upload most relevant ones to NotebookLM and discuss about the topic with the LLM to better understand some details
  3. Write an initial SAP draft with help from NotebookLM
  4. Start writing analysis code in my PC
  5. Create Ai agent skill specialized on topic XYZ. One must create a “reference” folder specifically for each skill, so you limit their scope of technical details to references you trust.
  6. Adjust analysis code in collaboration with the agent.
2 Likes
2 Likes

In all seriousness, though: Back in the old days — long before the current LLM bubble — it was a commonplace that “any doctor who can be replaced by a computer, deserves to be.” Except from @tspeidel I’m not detecting any grave concern for the profession to temper all this enthusiasm.

It’s my vague recollection that Fisher and Box (e.g.) emphasized the role of the Statistician as a scientific consultant. Surely this aspect of statistical practice can’t be mechanized?

5 Likes

I think it’s more coming to terms with that the technology will not go away and try to steer its use in a way that might improve statistical practice. LLMs often suggest analysis that I think are very suboptimal or downright bad, and it’s a huge issue that everyone can now let some LLM coded ML algorithm loose on their data without any understanding. But I highly doubt that pointing this out is going to change anything.

It’s also just very useful to have an LLM write some code for you when you already know exactly what you want and it adheres to ones prefered style etc. In any case, I think the more you already know, the more you will profit, because you easily spot the (frequent and sometimes severe mistakes) easily.

3 Likes

Tobacco once seemed utterly entrenched here in the U.S., backed by powerful interests and even with the complicity of Hollywood. Yet now it’s remarkable to see someone smoking in public.

1 Like

On the other hand I posed two complex questions of ChatGPT and got pretty darn good answers. One was to survey unsupervised learning methods and the other was dealing with non-random dropout in a clinical trial. My experience with code writing, matrix calculus, and especially code translation is here.

2 Likes