This is a place to discuss the article by Amit Khera, Sekar Kathiresan, and others, following extensive discussions on Twitter.
What if I had a new way to measure current blood glucose levels and the R^2 between the new measurement and the gold-standard from a blood draw was 0.15, but the correlation between the decile of the new measurement with the mean gold standard level within each decile were 1.0. Would I believe that the new measurement is ready for prime time?
The use of a polygenic approach to assessing associations between genetic variants and obesity, without wasting information in the data in an attempt to select “winning” and “losing” variants, has merit. The high-dimensional statistical analysis of 2,100,302 genetic variants doesn’t have any obvious shortcomings, although it would be very interesting to compare the R^2 it achieves in predicting log BMI to that achieved by using a naive Bayes (“stupid regression”) algorithm that estimates 2,100,302 regression coefficients univariately then just puts them together and recalibrates them against log BMI. The data source is the UK Biobank, with 119,951 participants aged 40-69. Five genome-wide polygenic scores (GPS) were created, because 5 tuning parameters were tried. There was a “statistically significant” association of each score with BMI (p < 0.0001), a test that is not relevant. What is important is the explained variation in BMI. [Note the analysis should probably used log BMI throughout but there are more important issues—see below.] The proportion of variance in BMI explained by the 5 scores ranged from 0.08 to 0.085. The top score was used in subsequent steps. [Note that the paper is a bit misleading by providing the correlation coefficients and not their squares, the latter being the proportion of explained outcome variation.]
The apparently best discriminating GPS was tested in 306,125 individuals from four independent testing datasets. First 288,016 middle-age individuals were analyzed, resulting in R^{2} = 0.084. The paper did not describe an important issue of exactly how the validation R^2 was computed. It is imperative that no re-calibration be done on the GPS, i.e., that this formula be used: R^{2} = 1 - \frac{SSE}{SST} where SSE is the sum of squared errors and SST is n-1 times the variance of BMI. Many researchers have made the mistake of using the usual correlation coefficient formula, which allows for an internal recalibration of the predictions, making the R^2 optimistic.
Before describing some serious analytical problems with the research taking the study design and measurements for granted, it should be noted that BMI, being made up of height and weight, is a problematic variable for analysis. As described here it is better to analyze weight adjusted for height. Furthermore, both height and weight are response (dependent) variables and should be analyzed separately. It is possible, for example, for genetics to alter the relationship between height and weight in such a way that BMI is no longer a good summary of them. And it is often the case that using within-dataset normalization is better than using an external equation such as the BMI equation. That is what weight adjusted for height attempts to do. And what if the GPS is really predicting height and not weight?
Turning back to the authors’ analyses, the next step taken is where IMHO things started to go seriously wrong. The authors went to the trouble of developing a GPS, then in effect declared war on their own risk score by abandoning its absolute value and converting it into a highly nonlinearly transformed and information-losing variable: decile groups. The oddity of such a nonlinear transformation is exemplified by the property that it can be arbitrarily changed by changing subject inclusion criteria. Problems with quantile grouping are spelled out in BBR Section 10.6 and here.
Using deciles has enough statistical problems if this is where it ended. But unfortunately the authors chose to emphasize the comparison of the top decile with the bottom decile. [Note: correct terminology is top tenth vs. bottom tenth.] This seriously exaggerates the predictive discrimination of the GPS and completely hides the degree of unexplained variation. As detailed here, explained outcome variation (e.g., the fraction 0.084 listed above) are superior for the purpose at hand. Explained variation is also non-arbitrary. Why not 20-tiles? 15-tiles? But the real answer is that no grouping is needed at all.
All of this ignores the fact that birth weight and weight in the first few years of life would increase explained variation far beyond 0.084.
The authors went on to present various comparisons of extremes and to make arbitrary designations of things like “carriers.” The further comparisons of extremes exaggerate their findings even more.
The authors chose to ignore a basic principle of statistical analysis. When one has a continuous variable (here a summary variable, GPS, since there are too many individual predictors to depict), the gold standard way to present its relationship with a dependent variable is a scatterplot, following by a smooth predicted mean (or median) line graph supplemented with a marginal histogram of the GPS (using, say, 200 bins for GPS; or show the empirical cumulative distribution of GPS which requires no binning). Artificial distortions of the primary data only hide what is really going on. Ignoring other design considerations, the 0.084 fraction of explained BMI variation is the bottom line, and the paper could have been much shorter. For an example scatterplot suitable for extremely large sample sizes, see Figure 4.12 in BBR.
For a trajectory analysis I would like to see a longitudinal model where the GPS is allowed to nonlinearly interact with age, e.g., have a model with GPS, a regression spline in age, and the products of all the terms contained in those two factors. Then the estimated GPS effect over time can easily be plotted.
See also these articles: