Standards/guideline for data handling

has anyone developed a standard operating procedure or guideline for handling data at their ARO that they could share? Or maybe you have particular advice. I’m currently developing a guideline for ensuring data integrity. There are many common sense things eg:

-version control, making datasets read-only
-folder structure? this varies from one ARO to another but the motivation is the same ie don’t accidentally pick up old data or code: archiving datasets/code
-saving sas log files for documentation
-validating import/export of data (importing is not always straightforward)
-basic data checks eg look for duplicates, missing data (create a flow chart to summarise N -> n), comparing against old data if they exist
-reviewing variable formats and labels
-code that follows programming standards (eg points to sources/references), programs are run in order
-limit access to data

I cannot find example guidelines online. There are guidelines for analysis, but i’m concerned only with data integrity: getting the data into the software and keeping it in tact. I’d like something that would survive a hypothetical audit by a finicky ‘worst case scenario’ type QA expert. The main concern is that the statistician is asked to return to an analysis after years have passed (which often happens) and they cannot easily discern the state of things or even reproduce their own numbers

1 Like

Excellent question, which I’ll follow for the insights it provokes. I can’t point you to complete guidelines, here are a few resources:

  1. A cautionary tale that might be useful in promoting the local adoption of your SOP may be provided by this JAMA Retraction & Replacement, necessitated by “errors … due to failure to update results from an earlier set of models.” SAS (which you mention) was used for that analysis.

  2. Jenny Bryan has unsettled my own thinking (originating in my software-engineering background—a factor she discusses in this tweet BTW) with a section of her online book Happy Git and GitHub for the useR, titled Get over your hang ups re: committing derived products.

  3. I’m going to guess that (as a rule) pharmacometricians have toolchains with a few more software products than statisticians. Thus, they ‘have it worse’ with regards to this reproducibility problem, and the solutions they have developed might be worth comparing and learning from. There’s even an SxP SIG (jointly ASA & ISoP) that attempts to achieve some ecumenicism between these ‘camps’. Pharmacometricians to connect with on Twitter on such matters include @JustinWilkins, @MikeKSmith and @VijayIvaturi.

2 Likes

Some things I am aware of off the top of my head:

The American Statistical Association Reproducibility Guidelines (not much so far but it is a start)
https://www.amstat.org/asa/News/ASA-Develops-Reproducible-Research-Recommendations.aspx

You might find the materials here useful for training

If I come across any more, I’ll update this post.

1 Like

See also http://www.nature.com/news/how-quality-control-could-save-your-science-1.19223?WT.mc_id=SFB_NNEWS_1508_RHBox

1 Like

excellent! i will look at these resources and report back. cheers

we can speculate on the level of code review they did for that jama analysis, my guess is: as good as none. It’s shocking and depressing. The scientific method is a method; it promotes scepticism and diligence etc. Some people have no knack for it, no instinct or inclination for it, and the qualifications mean nothing. Who doesn’t have a phd in science nowadays.

1 Like

The book “Statistical Data Cleaning with Applications in R” (Wiley, 2018) by Mark van der Loo and Edwin De Jonge discusses the “statistical value chain”, and has several relevant chapters.

There is an online companion website for the book: www.data-cleaning.org with R code.

3 Likes

Possible considerations:
W3C provenance
Digital signature
Databases with journaling (all previous states of a record are kept)
FDA guidelines on data handling for medical devices (although not specific to your purpose)
ISACA or IIA (audit associations) guidelines on data integrity (although not specific to your purpose)

2 Likes

Have you heard about the FAIR data principles: https://www.dtls.nl/fair-data/fair-data/

I’m not sure if this is what you want as it is focussed on open reusable data but it probably ticks alot of your boxes above too.

3 Likes

this looks like a great initiative. i’ll look into it immediately, cheers

1 Like