What does it mean when a test/predictor is 'predictive on the aggregate level but not on the individual level'

I recently listened to the episode ‘What’s Wrong with the IAT? (with Jesse Singal)’ of the podcast ‘Two Psychologists Four Beers’ (https://fourbeers.fireside.fm/13). The hosts together with their guest Jessy Singal discuss the IAT (a psychometric test). At around 38:35 min, Jessy explains that the test only explains about 1% of the variance in behavioral outcomes and concludes, together with the hosts, that the test does not predict behavior on the individual level but only on the aggregate level (i.e. on population level).

I wonder what that means, ‘not predictive on the individual level but on the aggegrate level’ (or on average?). Is this a property of a test? Is there a cut off value for this? Is this simply another way of saying the effect size is extremely small?

No cutoff exists for this, so I find the concept not extremely helpful. It is possible to say something like “this feature or this predictive model is not discriminating enough to be helpful for individual decision making, but is more useful for describing groups.” But this begs the question of “compared to what?”. If the best available knowledge explains 0.01 of variation in Y and that’s all we have, we still use it to predict tendencies in Y for individual subjects. You would only not use the instrument if someone has a better one that explains much more of the outcome variation.

If the outcome variable Y is binary and you explain 0.01 of the variation, the predicted probabilities may hover around, say, 0.001-0.1. For some hard-to-predict outcomes such as being struck by lightning, that would still be seen as very helpful, and we wouldn’t automatically try to find some group of people to use the model on.

So yes, this is related to small effect sizes.

I think a hierarchical model could answer some more questions regarding to small effect sizes, but depends on model in question. In a hierarchical observations with small sample sizes will shrink to population estimates, and we can pool more similar observations together. The small effect sizes can still be measured, they’ll just be less certain (or closer to the prior, which I’m referring to as the population distribution). A good reference for this is Gelman and Hill (2006) hierarchical and multilevel models, or something like that. I forget the title.

I suppose it has to do with ergodicity?