Observations on Big Data, Precision Health, and Machine Learning

Examples: objective physical activity data collected continuously to augment what occasional 6 minute walk tests tell us; continuous monitoring of cigarette smoking.

1 Like

From a medical standpoint, I’m pretty skeptical of the potential value of most types of physiologic monitoring, beyond what is already available.

I can see how monitoring could be quite valuable in certain research areas (e.g., objective measurement of physical activity level like #steps/day in those being treated for progressive/chronic neurologic conditions or in rehabilitation science), but activity level can be measured fairly easily/inexpensively/non-invasively. Maybe some type of outpatient monitoring in patients with certain types of epilepsy could also prove useful?

It would be great if we could somehow monitor BP continuously in patients over the long course of antihypertensive clinical trials, to better characterize the relationship between BP and CV outcomes. But unless we can then afford to outfit every hypertensive patient in daily practice with some type of continuous BP monitor after the RCT is done (so as to better characterize his own BP response to treatment), the value of that more granular RCT knowledge will be limited.

We can detect atrial fibrillation with a watch, but, to the best of my knowledge, we really have very limited understanding of when the risk/benefit ratio for anticoagulation becomes favourable for a given patient as a function of overall a.fib burden. The patients enrolled in the old anticoagulation trials for a.fib did not have their a.fib diagnosed based on a 2 minute run of a.fib detected on a watch…

Continuous glucose monitoring has been liberating for many patients with diabetes who are treated with insulin. The ability to promptly identify, and therefore treat, hypoglycaemia, and to adjust insulin doses accordingly, has been a life-changing medical advance for many patients.

Aside from the above examples in specific disease areas, though, my impression is that most of those who extol the “untapped potential” of other types of physiologic monitoring are divorced from the realities of medical diagnosis/practice and the needs of patients and are motivated by greed rather than altruism. There’s a lot of money to be made by convincing the affluent “worried well” that there’s inherent value in obsessive monitoring of every bodily function, but precious little evidence that such monitoring changes outcomes.

3 Likes

Excellent points Erin. Instead of efficacy I wonder about the potential of personal monitoring of some sort to improve dosage of meds.

2 Likes

The crucial element of such monitoring is, I think, whether it gets folded into control loops or whether it merely “captures” information for the extractive industry of what you nicely termed “precision capitalism.”

There’s certainly a place for retrospective, higher-order learning about the control algorithms themselves. But I posit that any purely extractive “capturing” of information solely to support promised inference in the future isn’t sincerely motivated by patients’ needs.

3 Likes

Thinking “head-to-toe” of all drug classes in common use, there are actually very few for which drug dose would plausibly be adjusted based on some type of physiologic monitoring, either for efficacy or safety reasons. For symptomatic conditions, we usually dose based on symptomatic improvement (e.g., inhalers for asthma, analgesics for nerve pain, beta blockers for angina); toxicity is gauged not by blood level, but by reported dose-limiting side effects or office findings (e.g., sedation with gabapentinoids; bradycardia with beta blockers). These dosing decisions would not likely be meaningfully enhanced by more detailed physiologic monitoring. For asymptomatic conditions (e.g., asymptomatic CAD), we tailor dose to level of risk, often in combination with periodic blood testing to guide dose adjustment (e.g., aiming for substantial decrease in LDL)- decisions to stop a statin are largely clinical, not lab-based.

In addition to ambulatory BP monitoring and blood glucose monitoring for patients on insulin, home INR monitoring for patients on warfarin who have labile INRs, or antiepileptic drug level monitoring for patients with very brittle epilepsy (who also quickly develop side effects at supra-therapeutic drug levels), are the only examples I could think of for which more frequent physiologic monitoring plausibly could be helpful.

3 Likes

One thing that frustrates (well…one of the many things that frustrates) is how “generalizability” is used as a cudgel to diminish trials and promote so called RWD. However, just as nobody steps into the same river twice, there is no assurance that your prediction/effect-estimate calculated from RWD is similarly useful outside the specific context it was developed in. Yet there is seemingly zero attention paid to the possibility of changing/different distributions of major drivers of risk or effect modifiers, while the first thing any trial gets called out for is how its effect estimates can’t possibility be transportable, even though trialists seem to talk about this challenge much much more than he RWD set.

7 Likes

Extremely well said Darren. This reminds me of my belief that what RWD excels at is describing prevailing outcomes of prevailing treatment strategies. Not much more than that (other than doing a good job of describing who gets which treatment) and not causal.

1 Like

I share your pessimism @f2harrell about the ability of high-dimensional data, electronic health record research, precision medicine, and so-called heterogeneity of treatment effect to provide useful clinical information. Traditionally, clinicians like me use diagnostic tests on their own to do this but admittedly with limited success. Simplistically, a test results within the normal range predicts that there is a low probability of an adverse disease outcome without treatment and it remains much the same if treatment is given. However, if it is outside the normal range, it is assumed that there will be a difference between the probability of the outcome on treatment and no treatment. However, the probabilities of an outcome on treatment and no treatment changes with the ‘severity’ of the test result and not in a ‘cliff edge’ within and outside the normal range.

It should be possible with care to take covariates into consideration possibly to increase the differences between the probabilities of an outcome with and without treatment. For example, if the albumin excretion rate (AER) is low (e.g. 20mcg/min) then it suggests that there is little renal glomerular damage and therefore little scope for improvement by treatment with an angiotensin receptor blocker (ARB). The probability of developing ‘nephropathy’ within 2 years is therefore about the same on control (e.g. 0.02) and treatment (e.g. 0.01) in figure 5 of a previous post: Should one derive risk difference from the odds ratio? - #340 by HuwLlewelyn . However, at an AER of 100mcg/min, the probability of nephropathy on control is 0.24 and on treatment it is 0.1, a risk difference of 0.14. In the RCT the covariates HbA1c and BP were kept to a minimum by treatment before randomisation so that the baseline risk was very low in the RCT. However, if there was poor diabetic control in an individual as evidenced by a high HbA1c, then this high risk should not be improved by treatment with an ARB as the latter does not improve diabetic control so that the risk reduction at an AER of 100mcg/min, would remain about 0.14. Another source of ‘heterogeneity of treatment effect’ would be the drug dosage of course, the expected difference being zero at a drug dosage of 0mg per day, increasing as the drug dosage is increased.

The above reasoning represents a hypothesis based on the results of a RCT. I agree therefore that it has to be tested by setting up calibration curves. I would be optimistic that a model based on the above type of reasoning based on a RCT result would provide helpful clinical predictions, unlike more speculative approaches.

5 Likes

Huw - this is the kind of reasoning that should take place much more frequently than the “wing and a prayer” high-dimensional analysis that seems to get most of the research funding. Thanks for your thoughtful post, as always. And there is a lot of unharvested information in clinical lab data.

3 Likes

I have read the above paper. It does not seem to say so explicitly, but is it possible to use the fitted loess curves to arrive at a function that ‘calibrates’ the model’s estimated probabilities so that when each of a very large number of the ‘calibrated probabilities’ are plotted against the frequency of correct predictions, they should provide a line of identity?

I think that’s right. It’s just that we don’t have a validation of the corrected calibration curve.

These look like interesting papers, and I’ve only just glanced at Figures 6C and 6D, so apologies if there is an obvious misconception on my part.

I would not call the problem you cite in those figures confounding. If you were interested in the causal effect of prognostic variables on the outcome, perhaps. But we’re not. Rather, we’re interested in applying the whole predictive function to external populations. (That causal effects of the outcome may make the best predictors—while I agree—is somewhat beside the point.) Geographic area or baseline risk, as illustrated, would be potential modifiers of the effect of treatment on Y, which is why they must be accounted for when transporting predictions from one population to another.

Does that make sense? Perhaps once I read the papers I’ll have a better feel for this language in the context of prediction. Thanks for sharing.

3 Likes

No prob. Note however that the argument here becomes circular whereby you would only call a variable a confounder if you were interested in what you would define as causal inference. However, regardless of the definition used, the assumed causal relationship shown in these figures is of the type: X ← C → Y where X is the prognostic variables and Y is the outcome. We typically call C here a “confounder”.

Even if we do decide to call “C” something else, that does not change the fact that it is the presumed commonalities in causal mechanisms that license the transportation of knowledge across populations. Thus, we cannot avoid the use of causal considerations no matter how hard we try. They lie at the foundations of statistical science, including when predictive models are being developed.

2 Likes

Pertinent to this discussion thread, I wish to draw attention to a special issue of Perspectives in Biology and Medicine. Volume 61, Number 4, Autumn 2018 with the title: The Precision Medicine Bubble. The issue takes up the topic of precision medicine in some depth. It has a refreshingly cynical take on the contributions of precision medicine to date.”

https://muse.jhu.edu/issue/39661

Particular attention is drawn to the paper by Sui Huang.

Huang S. The Tension Between Big Data and Theory in the “Omics” Era of Biomedical Research. Perspect Biol Med. 2018;61(4):472-488. doi: 10.1353/pbm.2018.0058. PMID: 30613031.

Contrasting the value of “big data” in commerce and biomedicine, Huang notes that:

The problem is that the spectacular success of the internet-based applications of Big Data has tempted biomedical researchers to think that using clever algorithms to mine and comb the vast amount of data produced by the omics revolution will recapitulate the success of Google, Amazon, or Netflix.

Going further, Huang says:

The fundamental differences between the natural sciences, on the one hand, which seek new understanding of organisms, and “data sciences” on the other hand, which serve consumer applications, give rise to formidable challenges for a quick adoption of the Big Data approach to biology and medicine. The human body and its (mal)functions are more complex than recognizing cats in photos or predicting client habits from purchase history. The latter tasks can, without denigration, be considered “superficial”: here data directly maps to utility without the need of a theory that formalizes our understanding of the mechanism of how data translate into useful knowledge. But such heuristics does not lead far in basic sciences, notably the life sciences.

In the same issue of this journal, epidemiologists Nigel Paneth and Sten Vermund highlight major advances in public health made over the last century. None used big data, precision medicine, or machine learning.

Considering precision medicine, Paneth and Vermund go so far as to posit that:

Precision medicine built on a foundation of host genetics can benefit some patients, but it has no realistic chance of linking human genetics to population-level health improvement. There are too few diseases where human genetic variation will make a substantial difference in approaches to screening, diagnosis, or therapy to justify the disproportionate investments into this approach as a principal priority for the NIH and for the private sector.

Paneth N, Vermund SH. Human Molecular Genetics Has Not Yet Contributed to Measurable Public Health Advances. Perspect Biol Med. 2018;61(4):537-549. doi: 10.1353/pbm.2018.0063. PMID: 30613036.

Also in this issue, epidemiologist Richard S. Cooper reviews the history of the “cardiovascular prevention movement” over the last 40 years, highlighting its spectacular success in reducing mortality due to cardiovascular disease. He mentions two pivotal observational studies that might be called “little data”–the Framingham Study with 5,209 men and women and the “Seven Countries Study” with about 12,000 men—yet identified the modifiable risk factors for cardiovascular disease (serum cholesterol, hypertension, cigarette smoking) that are the cornerstone of interventions that account for a substantial proportion of the decline in cardiovascular disease mortality.

Cooper notes that:

It will not escape the notice of the reader of this issue of the journal that genomics and “precision medicine” have, to date, made no contribution whatsoever to control of CVD as a mass disease.”

Cooper RS. Control of Cardiovascular Disease in the 20th Century: Meeting the Challenge of Chronic Degenerative Disease. Perspect Biol Med. 2018;61(4):550-559. doi: 10.1353/pbm.2018.0064. PMID: 30613037.

3 Likes

Useful articles but some of the claims are overstated. Big data and traditional machine learning are in many ways the opposite of, or at least orthogonal to, precision medicine which bases its inferences on mechanistic considerations regarding the biology of the disease and other covariates of each patient seen in clinic. There is a fair number of my patients that should have been dead from their cancer based on population-level guidelines and recommendations. But they are alive thanks to patient-specific interventions derived from biological knowledge. One such recent example is described here (from 1:23:00 onwards).

A good summary from a data science perspective of the distinction and trade-offs between patient relevance (a focus of precision medicine) and population-level robustness can be found here. Some inferential approaches will typically provide better balances between relevance and robustness. However, different stakeholders will irrespectively have different opinions on the optimal trade-offs between the two. And that is OK.

3 Likes

I really like the Huang article cited above. It focuses primarily on the unrealistic expectations of “Big Data” proponents and doesn’t really discuss precision medicine per se.

I didn’t understand much of your talk in the link you provided (not being an oncologist). But I think your general point is that there are situations in medicine where obtaining very specific knowledge about an individual patient can meaningfully alter therapeutic decisions and prognosis. This is particularly true in oncology, where, for example, a deep understanding of the genetics/biology of a given patient’s tumour can (sometimes) suggest more or less rational treatment choices. If we know that the tumour in a given patient is not being driven by a certain biologic pathway, then giving the patient a drug that inhibits that pathway will be futile.

Your assertion that the “Big Data” hype is in many ways “orthogonal” to precision medicine feels on point. It feels like the (relatively) small number of success stories in precision medicine (e.g., trastuzumab) might have been misconstrued by those with careers focused on data analysis (rather than biology/medicine) as evidence of the inherent value of simply gathering more and more biologic data in a hypothesis-free manner. Far from understanding the intentionality (painstaking triangulation of data sources) that likely underlies development of effective targeted therapies, Big Data proponents seem to be under the illusion that such therapies arose simply through “brute force” computerized analysis of reams of biologic data.

History is a great teacher- in science, we don’t spend enough time taking stock of how we got to where we are. Someone should write a paper, using examples of precision medicine “success” stories, highlighting the key role of intentionality in development of targeted therapies, contrasting these stories with the hypothesis-free data-dredging exercises being proposed by many Big Data proponents.

5 Likes

Could not have said it better. The bigger the data, the more pertinent it becomes that they are anchored by contextual knowledge. Otherwise big data serve as an excellent way to fool ourselves in ways that can harm patient care.

4 Likes

it’s interesting that the recent call from the fda inviting companies to suggest the use of RWD refers to ‘new study designs’ and innovative designs: https://www.raps.org/news-and-articles/news-articles/2022/10/fda-starts-pdufa-vii-programs-for-real-world-evide, showing that statisticians in industry have the opportunity to advance methods, and this is not done entirely in academia. Although we have to wait some time to see what new study designs they come up with. I hope people share in this thread any technical papers that begin to appear

4 Likes

i just noted a paper on linkedin: Paul Brown on LinkedIn: Principles of Experimental Design for Big Data Analysis

“experimental design for big data”, they describe a sequential design approach, ie an algorithm to subset the data using experimental design methodology following savage - select the design that maximises the expected utility: “our objective is to avoid the analysis of the big data of size N by selecting a subset of the data of size n using the principles of optimal experimental design where the goal of the analysis is predefined” Principles of Experimental Design for Big Data Analysis

i find it interesting/ appealing, a pseudo-Bayes approach that discards the prior

1 Like

open access: Contribution of Real-World Evidence in European Medicines Agency’s Regulatory Decision Making

quite a list of inherent and seemingly intractable problems: “The main issues discussed with respect to RWE were around methodological weaknesses, including missing data, lack of population representativeness, small sample size, lack of an adequate or prespecified analysis plan, and the risk of several types of confounding and bias (mostly selection bias), which was in line with previous studies.”

2 Likes