Analysing large registry databases and "insufficient memory"

the task: analysing patient registry data (millions of patients)

the problem: the data are so large that the stats software complains of “insufficient” memory when attempting to fit ‘complex’ models (i’ll leave out the detail)

the solution: buy more RAM, but i’m up to 64gb. Anything else?

some background info:
-there are plenty of tips online re making code more efficient, limiting CPU time, data storage etc.

-this guy complains of the same problem: “PROC PHREG : avec fragilité (effet aléatoire) + processus de comptage” [translation: ultimately they suggest buying more RAM]

-a potential solution?: Biom J, Analyzing Large Datasets with Bootstrap Penalization [they claim to make their code available, but as usual, the link doesnt work: broken link]

-maybe the best/simplest option: Analyzing Big Data in Psychology: A Split/Analyze/Meta-Analyze Approach (SAM)

has anyone implemented this before? i guess this problem is something statisticians encounter, unless we limit ourselevs to the most crude model? although it is not raised when discussing issues re registry data: Analysis, Interpretation, and Reporting of Registry Data To Evaluate Outcomes. i cannot see any references to the SAM approach in the medical literature, and i would have to let my computer run for weeks

edit: i would have to monitor progress and flag those sub-samples where the model failed to converge, a further complication, especially if it takes weeks to run

cheers

1 Like

You don’t say what software you’re using, but I think there are six general strategies for dealing with Big Data problems, which I define as “more data and/or model than your normal computing platform can handle, without funky/painful workarounds.”

In approximate order of ease:

  1. reduce dataset width
  2. reduce dataset height
  3. chunk in pieces
  4. expand hardware capacity
  5. simplify computations
  6. find more efficient software (absolute last resort, for me)

These are not mutually exclusive, of course.

Some more specific ideas:

  1. reduce dataset width

    • consult with SME(s)
    • drop large groups of variables included just because they’re available, if there’s no history supporting their usefulness
    • if categorical variables have many levels, consider
      • combining or ignoring low frequency levels … I’ve seen 1% and 0.5% used as thresholds
      • “impact coding” - basically assigning Naive Bayes weights so one categorical variable feeds into the models as only one numeric variable
        • see Zumel & Mount in Win-Vector Blog
    • 2-step analysis
      • Step 1: find top ## predictors, possibly using a sample of the data
      • Step 2: fit a good model on the full dataset, using the reduced set of variables, using your normal approach
      • Random Forests, because they randomly sample both rows and columns, are pretty good are identifying the top 20-50 predictors in a dataset
        • look at the VIF (variable importance factor)
      • it helps here to have a mindset that there isn’t ONE (for sure) BEST model, there are many possible models that fundamentally are very similar. (Especially once you get past the Top 20 Predictors.)
    • PCA
    • recursive feature elimination (“RFE”)
      • in R, the caret package by Max Kuhn does this
      • I confess I’m fuzzy on why RFE is less bad than traditional (discredited) stepwise approaches
  2. reduce dataset height

    • you have a lot of data … can you run on a 70% / 50% / 25% etc sample?
    • Mo’ Data is not (necessarily) the same as More Information
    • stratify … are there natural subgroups that may have different data relationships?
    • e.g., females vs males - if you were going to interact sex on some things, and you have a lot of data, maybe running Sex as a By-group is easier
    • in SAS, by-group processing is easy. (And you can INDEX on the by-var, rather than physically sorting a huge dataset.)
      • depending on your software of choice, may need to create physical or virtual subset copies of your BigDataset.
  3. chunk in pieces

    • stratification as listed above also fits here
    • model averaging - two main types
      (1) “bagging”, aka bootstrap aggregating
      * most uses of bootstrapping for model/test-statistic evaluation replicate the size of the original dataset … but that isn’t necessarily a requirement if your intent is to do model averaging
      (2) random non-overlapping partitions … 2/3/4/5/10 depending on the height and width of your data (i.e., be aware of what you’re doing to EPV)
      • For either of these, you could do feature selection/reduction either OUTSIDE THE LOOP or INSIDE
        • however, if the core reason you’re doing model averaging is just too much data, OUTSIDE the loop variable selection makes more sense
  4. expand hardware capacity

    • is it feasible to port the problem to the cloud, where you can rent a huge virtual-server for only as long as you need it?
    • I’ll mention parallel processing here … if you can chunk the problem into discrete manageable parts, you can probably speed up how long in takes in hours and minutes by executing in parallel
  5. simplify computations

    • I’ve seen people run OLS on logistic problems, and argue (sometimes with data) that the distortions weren’t large for the specific thing they were doing
      • hmmmm … but what comes to mind is that all models are flawed, yet some are useful … caveat emptor
    • but there may be other algorithms that present more or less memory burden than other options

Hope this gives you some more ideas.

3 Likes

This would be better for http://stackoverflow.com with carefully chosen tags such as r, sas, etc.

1 Like

Working link: https://github.com/guhjy/bootpenal

Although if you are having this issue in SAS you’re probably better off just trying to translate rather than use R which is probably even worse.

1 Like

thanks for the link! yeah, id definitely translate to sas, would help me to understand the approach at the same time

actually, it’s less a Q about computer power (despite the title of the post), because i’m resigned to that limitation, and more a Q of stats ie the meta-analysis approach, or boostrapping… I’m just surprised others haven’t been confronted with the issue (there is never any comment in the papers i’m reading re how it was handled). The worry then is that the limited computer power is dictating the analysis ie we are favouring crude models, and that seems to be the case in the literature

My intuition is that ML folks should have some insight here since they are actually fitting complex models to huge datasets on a semi-regular basis.

i saw some posts on medium from the ML folks but it’s all technical, incidental detail and no substance, pandas and who-knows-what eg https://medium.com/analytics-vidhya/split-apply-combine-strategy-for-data-mining-4fd6e2a0cc99

Have you seen:

Rhodes, KM, Turner, RM, Payne, RA, White, IR.
Computationally efficient methods for fitting mixed models to electronic health records data.
Statistics in Medicine . 2018; 37: 4557– 4570.
https://doi.org/10.1002/sim.7944

?

Might be what you’re looking for.

1 Like

My first idea would be the “expand hardware capacity” solution, but with a cloud approach (so that you don’t have to buy something which would be impossible and/or uneconomical for a few such calculations).

E.g. with Google Compute Engine, you can get almost 1 TB of memory for $6.3/hour.

An R package is available with which you can set up the whole environment in a minute, and use the familiar RStudio through web to carry out the calculations.

And, for GCE, you can get $300 and 1 year free by registering.

(Why this would be my first attempt? If it succeeds, you have to make no compromise, as with the other solutions.)

1 Like

looking forward to reading this, thanks!

Re the cloud approach, Google etc.

Is that an option if you are using individual patient data? Thankfully I don’t have this big data problem int he first place but if I did there is no way I could upload our data to a service like that for data protection reasons.

Good question. I can’t give you a legal answer, but practically, you could create the virtual machine instance, upload the data, carry out the calculations, download the results, and then delete the whole instance. (In contrast to having a permanent instance which you use from time to time.) This is the best way to surely get rid of any data in the cloud, as the instance itself is eliminated.

Now, while I think it’ll indeed delete every trace of the data, whether you can have a legal guarantee for this is something beyond my knowledge… (A quick search gives pages like this or this, you’d perhaps use this a starting point to find this out.)