Database for outcome data from RCTs in multiple conditions

Hi everyone, I’m a new member of the forum and recently qualified medical doctor with a passion for research and looking to form a career in clinical psychiatric research

I am currently working on multiple systematic review and meta-analyses for pharmacotherapy in various anxiety-related conditions.

We would like to build a comprehensive database of all RCTs conducted to date in anxiety spectrum disorders, however I am struggling to get my head around the best way to do this. Ideally one would like to extract all relevant outcome data for all treatments in all conditions and store it within a master database for future data mining as well as ongoing updating of the database as new RCTs are conducted and published.

The issue is that for each condition there may be 4 or 5 different outcome measures used, sometimes up to 15. There are about 5-10 conditions and there are in excess of 50 different drugs used, probably roughly 10 or more different drug classes. So to construct a spreadsheet containing all possible permutations of these (for example, columns for the active drug group n, mean, sd for a single outcome for a single drug in a single condition) would require a spreadsheet that has number of columns = number of conditions x drugs x outcomes x 9 (n, sd, mean for treatment and control plus overall effect size and CI upper and lower). So I would require 10s of 1000s of columns in order to keep all of the information separated in a way that may be useful for future data mining / analysis.

I was wondering if there is another software option available, such as a database manager or something along those lines that I could perhaps look into?

Alternatively if anyone has some advice or suggestions around building this database it would be much appreciated.

Another alternative is to not include outcome data within the database and rather only include study identifying data, the conditions studies, the treatments used and a risk of bias assessment, which would require many fewer fields but would ultimately not be as useful as a database that did collate outcome data.

Thank you for reading and for any input.

1 Like

Hi Jacob. Welcome to datamethods. I hope you get some useful responses from those in the know. An initial thought is that it will be very tough to amalgamate data across such studies because of the lack of concordance in how outcomes are measured. It may be the case that getting access to raw data on two of the better, larger studies would be more valuable than meta-analysis. One of many advantages of that is that you can do proper analysis. Many studies in psychiatry use misleading change-from-baseline patient scores. Problems with this are detailed at .


Hadley Wickham has a well-conceived and explained view of issues of structuring data for analysis, which he calls Tidy Data. If you have Tidy Data, then you can easily use the Tidyverse collection of data manipulation tools in R. HIGHLY RECOMMENDED.


link to paper:


unless i misunderstand your description, what you’re asking for is clearly impossible, ie you can never access “all RCTs conducted to date in anxiety spectrum disorders” to build a “comprehensive database”. For so many reasons that is completely impossible. what am i missing?

This is a worthwhile read. It assumes that you have a unified set of variables across observations though.

1 Like

Thank you for the useful resources and comments. I will respond in full tomorrow


Hi @pau13rown

Thank you for your comment/concerns. Perhaps I was not explaining myself clearly, possibly because I don’t fully understand what we’re trying to achieve or the limitations of the software that we’re working with. Let me try to explain our aims and rationale in more detail.

The research team that I am working with has conducted a number of meta-analyses for the efficacy and acceptability of pharmacotherapy for a number of anxiety spectrum disorders already. Each of these meta-analyses required the pooling of outcome data from a number of RCTs, acquired through systematic database searches and screening of results for RCTs that could be included or excluded from the meta-analysis for a number of reasons.

The database that we are trying to build would include the information from all of the previous meta-analyses conducted as well as future RCTs that could be added to the database. Ultimately we would like to expand upon it to include psychotherapeutic interventions as well. In psychiatric research this gets quite complicated due to the overlap in conditions, changes to diagnostic criteria over the years, multiple different drugs and dosing strategies in the RCTs, multiple outcome scales etc

When conducting a single meta-analysis the data extraction is relatively straightforward and can easily be done within excel. However, as the number of conditions, drugs and outcome scales for inclusion in the database increases the spreadsheet will become bloated with “placeholder” columns due to the number of combinations/permutations of conditions, drugs and outcome scales used and wanting to keep each of these combinations separated in order to conduct stats across for example all studies that investigated x drug in y condition looking at z outcome score. Perhaps I am missing something and perhaps there is a way of doing this within excel. Perhaps also the aim of combining all of this data within one dataset/database is too ambitious or impractical.

Forming a database of the RCTs that doesn’t include specific outcome data would also be fairly straightforward, simply creating a spreadsheet with one study per row that includes study identifying data, which drugs were studied, which outcome scales were used etc without including the n, mean, sd etc for each outcome. However question the utility of such a database as it would essentially be a subset of the large biomedical research databases anyway, and I struggle to conceptualise how this kind of database would add value without including all the outcome data.

I spoke to a few friends with a background in IT, and they suggested I look at storing the trial data as JSON objects and using SQL to query the database. So I am looking into that and working on a way to represent each trial as a single JSON object. If that is possible then perhaps an intuitive user interface could be developed for inputting and accessing the information, perhaps outputting subsets of the database into an excel spreadsheet or CSV for further analysis. For these future analyses there would be methodological concerns regarding heterogeneity across studies and ultimately the validity of further analyses would depend upon the consistency and paramaters set up for entering an RCT into the database in the first place.

We are only 1 month into this while also writing an updated review, so essentially still in a brainstorming phase and will value any input or criticism. I would be interested to hear your specific concerns and what your “many reasons” are as to why this is completely impossible. I’m always on the lookout for more factors to consider.

Thank you for your input and hope to hear back soon.

Hi @f2harrell

Thank you for your comments and suggestion. One of the main problems we are facing as you have pointed out is the sheer number and lack of consistency in outcome scales used and the manner in which they are reported. For many conditions there is a paucity of large, well-designed trials, which makes your suggestion of analysing data from two larger studies quite difficult. Conducting meta-analyses on a number of small trials is not without methodological concerns, however we are explicit about the shortcomings and limitations of the body of evidence and address this with recommendations for future research.

Thank you as well for the link, I will have a look at it today.


Hi @Doug_Dame

Thank you for your comments, I will have a look at the paper today! I haven’t used R before but am keen to learn as I know it’s a powerful package.


i see. When you say ‘outcome data’ you mean summary statistics? i thought you meant the outcome for every patient in the trial.

in that case it is more manageable. I’m not much of an IT person, but how about redcap? it used to be free, unfortunately i don’t think it is anymore

It sounds like what you need is a relational database so I think the tidydata paper is an appropriate introduction. There are approaches to data management that are meant to be flexible to the types of problems you talk about (eg being flexible to unanticipated outcomes, designs).

Don’t focus too much on the software, that paper is really about structure of data. You can apply these principles in any database software (RedCAP, Access). Any of these can be supplemented with a nice interface, forms etc. Excel is not strictly speaking appropriate but you could force it to work. JSON is just a way to store data about objects that isn’t necessarily dedicated for database work but could be the workhorse of a custom solution. I would highly recommend:

  1. Using a dedicated database software
  2. Having it built by someone who understands how to build a database. Our clinical trial center has a service for this and I imagine you could find something similar. It won’t be cheap but that’s because it is hard to get right.

Also just a word of caution that what your proposing is a lot more difficult in practice than it sounds when you’re sitting around a table. You’ll need to approach it as seriously as any other program of research if you want it to be useful.

From an analysis standpoint, I would also caution you against falling into the trap of leveraging this database in the hopes of providing constantly updating meta-analyses unless you consider all the problems for inference that come along with that. Consider looking at papers discussing methodological considerations for living meta-analysis if you anticipate this to be part of your plan.

1 Like

I’ve been thinking quite a bit about the challenges of meta-analysis for clinically heterogeneous conditions in the context of neurological rehabilitation (with CVA as a prime example) that I am organizing for a separate post for expert input. Much should apply towards your project.

Pay particular attention to the problem Dr. Harrell mentioned regarding change scores. At least in that case, you should be able to rescue the data that the primary researchers reported and correct them.

But the statistical errors in the published literature is much higher than you might expect. These errors are more likely to make a paper publishable, because using an alternative methods of analysis are less well-known. This will lead to an important source of bias in the studies you do find.

The Missing Medians: Exclusion of Ordinal Data from Meta-Analyses

These reporting considerations have important implications for meta-analysis. Where ordinal data are reported appropriately in individual studies, they are often excluded from meta-analysis due to the difficulty in pooling them. Alternatively, where study authors report means and standard deviations, often inappropriately, these data can be included in meta-analysis but the validity of the pooled results is questionable. Meta-analytical results are heavily influenced by treatment of outliers and by parametric versus non-parametric estimation [5]. The Cochrane collaboration acknowledge the problem with meta-analysis of ordinal or non-parametric data in their handbook (“difficulties will be encountered if studies have summarised their results using medians”, section 9.2.4[2]), but do not propose a solution. In practice, investigators often dichotomise data from shorter ordinal scales, and treat data from longer ordinal scales as continuous. Both of these approaches are sub-optimal. Dichotomising scales necessitates a loss of detail, and participants close to but on opposite sides of the split are characterised as very different rather than very similar. Statistical power is lost: a median split has been equated to discarding one-third of the data [6]. Treating data as continuous implies a consistent relationship between each level of the scale, which is not true of ordinal scales, and assumptions of normality are often violated. In the context of meta-analysis, it may be argued that, due to central limit theorem, mean values across a group of studies (and hence mean differences) will be approximately normally distributed, rendering any concerns about violation of normality invalid. Although this may be true, [for estimates with a finite variance – my emphasis] it fails to acknowledge that it is inappropriate to use means as a measure of central tendency for scales where we know only the order of levels on the scale, and not the distance between them.

This poses scholars without access to the individual patient data a tough question. To what extent can the clinical observations (summarized by inappropriate use of parametric effect sizes) be converted to a form where a defensible analysis can be performed?

While all of the references in a thread I’ve started on this topic should be helpful, the links in this post should be examined first, so you know what you are facing. A good meta-analysis is not easy.

1 Like

Yes, sorry, summary statistics is correct. Thanks I will look into Redcap as well. I haven’t used database software before so looking forward to making a start!

Hi @R_cubed. Thanks for your comments. I will certainly look through your other threads.