the task: analysing patient registry data (millions of patients)
the problem: the data are so large that the stats software complains of “insufficient” memory when attempting to fit ‘complex’ models (i’ll leave out the detail)
the solution: buy more RAM, but i’m up to 64gb. Anything else?
some background info:
-there are plenty of tips online re making code more efficient, limiting CPU time, data storage etc.
has anyone implemented this before? i guess this problem is something statisticians encounter, unless we limit ourselevs to the most crude model? although it is not raised when discussing issues re registry data: Analysis, Interpretation, and Reporting of Registry Data To Evaluate Outcomes. i cannot see any references to the SAM approach in the medical literature, and i would have to let my computer run for weeks
edit: i would have to monitor progress and flag those sub-samples where the model failed to converge, a further complication, especially if it takes weeks to run
You don’t say what software you’re using, but I think there are six general strategies for dealing with Big Data problems, which I define as “more data and/or model than your normal computing platform can handle, without funky/painful workarounds.”
In approximate order of ease:
reduce dataset width
reduce dataset height
chunk in pieces
expand hardware capacity
simplify computations
find more efficient software (absolute last resort, for me)
These are not mutually exclusive, of course.
Some more specific ideas:
reduce dataset width
consult with SME(s)
drop large groups of variables included just because they’re available, if there’s no history supporting their usefulness
if categorical variables have many levels, consider
combining or ignoring low frequency levels … I’ve seen 1% and 0.5% used as thresholds
“impact coding” - basically assigning Naive Bayes weights so one categorical variable feeds into the models as only one numeric variable
see Zumel & Mount in Win-Vector Blog
2-step analysis
Step 1: find top ## predictors, possibly using a sample of the data
Step 2: fit a good model on the full dataset, using the reduced set of variables, using your normal approach
Random Forests, because they randomly sample both rows and columns, are pretty good are identifying the top 20-50 predictors in a dataset
look at the VIF (variable importance factor)
it helps here to have a mindset that there isn’t ONE (for sure) BEST model, there are many possible models that fundamentally are very similar. (Especially once you get past the Top 20 Predictors.)
PCA
recursive feature elimination (“RFE”)
in R, the caret package by Max Kuhn does this
I confess I’m fuzzy on why RFE is less bad than traditional (discredited) stepwise approaches
reduce dataset height
you have a lot of data … can you run on a 70% / 50% / 25% etc sample?
Mo’ Data is not (necessarily) the same as More Information
stratify … are there natural subgroups that may have different data relationships?
e.g., females vs males - if you were going to interact sex on some things, and you have a lot of data, maybe running Sex as a By-group is easier
in SAS, by-group processing is easy. (And you can INDEX on the by-var, rather than physically sorting a huge dataset.)
depending on your software of choice, may need to create physical or virtual subset copies of your BigDataset.
chunk in pieces
stratification as listed above also fits here
model averaging - two main types
(1) “bagging”, aka bootstrap aggregating
* most uses of bootstrapping for model/test-statistic evaluation replicate the size of the original dataset … but that isn’t necessarily a requirement if your intent is to do model averaging
(2) random non-overlapping partitions … 2/3/4/5/10 depending on the height and width of your data (i.e., be aware of what you’re doing to EPV)
For either of these, you could do feature selection/reduction either OUTSIDE THE LOOP or INSIDE
however, if the core reason you’re doing model averaging is just too much data, OUTSIDE the loop variable selection makes more sense
expand hardware capacity
is it feasible to port the problem to the cloud, where you can rent a huge virtual-server for only as long as you need it?
I’ll mention parallel processing here … if you can chunk the problem into discrete manageable parts, you can probably speed up how long in takes in hours and minutes by executing in parallel
simplify computations
I’ve seen people run OLS on logistic problems, and argue (sometimes with data) that the distortions weren’t large for the specific thing they were doing
hmmmm … but what comes to mind is that all models are flawed, yet some are useful … caveat emptor
but there may be other algorithms that present more or less memory burden than other options
actually, it’s less a Q about computer power (despite the title of the post), because i’m resigned to that limitation, and more a Q of stats ie the meta-analysis approach, or boostrapping… I’m just surprised others haven’t been confronted with the issue (there is never any comment in the papers i’m reading re how it was handled). The worry then is that the limited computer power is dictating the analysis ie we are favouring crude models, and that seems to be the case in the literature
Rhodes, KM, Turner, RM, Payne, RA, White, IR.
Computationally efficient methods for fitting mixed models to electronic health records data.
Statistics in Medicine . 2018; 37: 4557– 4570. https://doi.org/10.1002/sim.7944
My first idea would be the “expand hardware capacity” solution, but with a cloud approach (so that you don’t have to buy something which would be impossible and/or uneconomical for a few such calculations).
E.g. with Google Compute Engine, you can get almost 1 TB of memory for $6.3/hour.
An R package is available with which you can set up the whole environment in a minute, and use the familiar RStudio through web to carry out the calculations.
And, for GCE, you can get $300 and 1 year free by registering.
(Why this would be my first attempt? If it succeeds, you have to make no compromise, as with the other solutions.)
Is that an option if you are using individual patient data? Thankfully I don’t have this big data problem int he first place but if I did there is no way I could upload our data to a service like that for data protection reasons.
Good question. I can’t give you a legal answer, but practically, you could create the virtual machine instance, upload the data, carry out the calculations, download the results, and then delete the whole instance. (In contrast to having a permanent instance which you use from time to time.) This is the best way to surely get rid of any data in the cloud, as the instance itself is eliminated.
Now, while I think it’ll indeed delete every trace of the data, whether you can have a legal guarantee for this is something beyond my knowledge… (A quick search gives pages like this or this, you’d perhaps use this a starting point to find this out.)