Discussion of Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood

This is a place to discuss the article by Amit Khera, Sekar Kathiresan, and others, following extensive discussions on Twitter.

What if I had a new way to measure current blood glucose levels and the R^2 between the new measurement and the gold-standard from a blood draw was 0.15, but the correlation between the decile of the new measurement with the mean gold standard level within each decile were 1.0. Would I believe that the new measurement is ready for prime time?

The use of a polygenic approach to assessing associations between genetic variants and obesity, without wasting information in the data in an attempt to select “winning” and “losing” variants, has merit. The high-dimensional statistical analysis of 2,100,302 genetic variants doesn’t have any obvious shortcomings, although it would be very interesting to compare the R^2 it achieves in predicting log BMI to that achieved by using a naive Bayes (“stupid regression”) algorithm that estimates 2,100,302 regression coefficients univariately then just puts them together and recalibrates them against log BMI. The data source is the UK Biobank, with 119,951 participants aged 40-69. Five genome-wide polygenic scores (GPS) were created, because 5 tuning parameters were tried. There was a “statistically significant” association of each score with BMI (p < 0.0001), a test that is not relevant. What is important is the explained variation in BMI. [Note the analysis should probably used log BMI throughout but there are more important issues—see below.] The proportion of variance in BMI explained by the 5 scores ranged from 0.08 to 0.085. The top score was used in subsequent steps. [Note that the paper is a bit misleading by providing the correlation coefficients and not their squares, the latter being the proportion of explained outcome variation.]

The apparently best discriminating GPS was tested in 306,125 individuals from four independent testing datasets. First 288,016 middle-age individuals were analyzed, resulting in R^{2} = 0.084. The paper did not describe an important issue of exactly how the validation R^2 was computed. It is imperative that no re-calibration be done on the GPS, i.e., that this formula be used: R^{2} = 1 - \frac{SSE}{SST} where SSE is the sum of squared errors and SST is n-1 times the variance of BMI. Many researchers have made the mistake of using the usual correlation coefficient formula, which allows for an internal recalibration of the predictions, making the R^2 optimistic.

Before describing some serious analytical problems with the research taking the study design and measurements for granted, it should be noted that BMI, being made up of height and weight, is a problematic variable for analysis. As described here it is better to analyze weight adjusted for height. Furthermore, both height and weight are response (dependent) variables and should be analyzed separately. It is possible, for example, for genetics to alter the relationship between height and weight in such a way that BMI is no longer a good summary of them. And it is often the case that using within-dataset normalization is better than using an external equation such as the BMI equation. That is what weight adjusted for height attempts to do. And what if the GPS is really predicting height and not weight?

Turning back to the authors’ analyses, the next step taken is where IMHO things started to go seriously wrong. The authors went to the trouble of developing a GPS, then in effect declared war on their own risk score by abandoning its absolute value and converting it into a highly nonlinearly transformed and information-losing variable: decile groups. The oddity of such a nonlinear transformation is exemplified by the property that it can be arbitrarily changed by changing subject inclusion criteria. Problems with quantile grouping are spelled out in BBR Section 10.6 and here.

Using deciles has enough statistical problems if this is where it ended. But unfortunately the authors chose to emphasize the comparison of the top decile with the bottom decile. [Note: correct terminology is top tenth vs. bottom tenth.] This seriously exaggerates the predictive discrimination of the GPS and completely hides the degree of unexplained variation. As detailed here, explained outcome variation (e.g., the fraction 0.084 listed above) are superior for the purpose at hand. Explained variation is also non-arbitrary. Why not 20-tiles? 15-tiles? But the real answer is that no grouping is needed at all.

All of this ignores the fact that birth weight and weight in the first few years of life would increase explained variation far beyond 0.084.

The authors went on to present various comparisons of extremes and to make arbitrary designations of things like “carriers.” The further comparisons of extremes exaggerate their findings even more.

The authors chose to ignore a basic principle of statistical analysis. When one has a continuous variable (here a summary variable, GPS, since there are too many individual predictors to depict), the gold standard way to present its relationship with a dependent variable is a scatterplot, following by a smooth predicted mean (or median) line graph supplemented with a marginal histogram of the GPS (using, say, 200 bins for GPS; or show the empirical cumulative distribution of GPS which requires no binning). Artificial distortions of the primary data only hide what is really going on. Ignoring other design considerations, the 0.084 fraction of explained BMI variation is the bottom line, and the paper could have been much shorter. For an example scatterplot suitable for extremely large sample sizes, see Figure 4.12 in BBR.

For a trajectory analysis I would like to see a longitudinal model where the GPS is allowed to nonlinearly interact with age, e.g., have a model with GPS, a regression spline in age, and the products of all the terms contained in those two factors. Then the estimated GPS effect over time can easily be plotted.

See also these articles:


I agree with Frank’s observations, but read the paper from a different perspective: it’s potential for use in the prevention of disease.

From that perspective, it’s not what I was hoping for. The paper is essentially an association study. All analyses shown are about association of the PRS with different BMI related outcomes in different populations. But association is no surprise. The PRS is created from SNPs that are associated with BMI. LDpred, if I understand the method correctly, maximize the explained variance for the 2M snps, given their prior weights, LD patterns and rho. The PRS is expected to be associated, it is created that way. Like in their previous paper (Khera et al, Nat Genet 2018), the authors constructed several models and selected the one with the best performance. See our comments on that here.

The strength of association with the different outcomes could be a question of interest, but here the presentation of deciles hampers meaningful comparisons. All y-axes are tailored to the range of values observed, distributions look the same, but we cannot judge whether the associations are of similar strength. One single metric for all outcomes, like variance explained, could have shown that, say, the strength of association in children is the same as in adults. Now the correlation that is given is in adults, which is 0.29, translating to an R2 of roughly 9%. I assume that R2 was lower for all other associations.

I was also surprised to see that the correlation of the 141 SNP PRS was much lower than the 2M PRS, translating into R2 of roughly 2% versus 9%. In their last year’s paper, they compared GWAS PRS with 6.6M SNP PRS, which showed minimal improvements in AUC. The authors do not discuss how and why the improvement is so much larger here. Could it be because the PRS was here fitted on a continuous outcome rather than dichotomous disease status in the previous paper?

In my opinion, the question of interest is not the strength of association, but the translation to practice: can the PRS be used to identify people at high risk, and if so, how? Most of the media coverage leans in that direction too, but the current results are not helpful for that discussion. Two discussions that I missed here.

First, high risk of what? A large part of the paper focuses on severe obesity, not obesity, but the authors do not justify this choice. I find it hard to think how PRS as a test would be used to identify people with severe obesity (BMI>40) and not obesity (BMI>30), which were approx 2% and 25% of the participants. Why not focus on identifying high risk of obesity?

Second, how does PRS prediction compare with alternative opportunities? DNA is present at birth, but that doesn’t mean it is the predictor of choice for everything that happens later. What is the predictive ability of weight at 8yrs for adult obesity? What is the predictive ability of weight of the parents for child obesity at 8yrs? What is the predictive ability of being overweight as a teen for adult obesity? There are many alternatives that can be considered, and these may have higher predictive ability. As these cannot be investigated all, the first question should be: what do we want to prevent in whom? Let’s start there.


I’m not clear on the BMI definition used across the cohorts. Is the last reported BMI? Then PRS may very well just be associated with generational cohorts and has nothing to do with genetic association. Should have a fixed BMI (BMI at age 50) or some appropriately age adjusted BMI measure (ignoring limitations of BMI modeling).


Or height-adjusted weight. Or ratio of weight to ideal weight. Good questions. For a paper about trajectories more detailed serial data analyses are needed.


This is a great summary of problems with quantile grouping of risk factors (in this case the GPS).

Is there are similar analysis/reference for problems that arise with grouping outcomes (in this case BMI) into bins/dichotomy/quantile?


For the tragedy of dichotomizing continuous Y see BBR Section 18.3.4 and the graph reproduced there from the excellent paper by Fedorov et al.


In this study evaluating PRS in schizophrenia, the authors found that the PRS was better at predicting ancestry than schizophrenia.


Here is another article from cell that discusses huge issues/limitations with portability of PRS.