Very small effect size wih significant p-value

I did a statistical analysis to see the effect of lifestyle factors (physical activity, dietary intake and other covariates) on longitudinal weight gain among women using linear mixed models. The data were collected at seven surveys from the same individual (n=12,000), followed up for 19 years.
Given sensible results for most covariates, I found a very small effect size with significant p-values for some diet variables (like carbohydrates: coef.=0.0000664, 95%CI=0.000132, 0.000264, P-value<0.0001 and fat intake coef=0.00005, 95%CI=0.00003, 0.00007, P-value<0.0001, total energy is also similar).
I am not sure about this? can anyone help me to interpret this? Is it because the sample size is large enough that small differences are statistically significant?
Is it important to report a very small beta effect because it’s significant?
May be pointless to report a very small effect size even if significant?

Many thanks in advance


the coefficients arent readily understood unless we know the units for eg carbohydrates. Regarding small but statistically significant effects on weight, i’d just note that drug companies report small differences and yet get approval, eg 5,6kg over 1 year will get you in the NEJM:

I agree with @pmbrown, the values of the coefficients don’t tell us much about the relevance of an effect, since we don’t know thier units. Even when knowing the units, one needs subject matter knowledge to judge the relevance of an effect. I dislike judging the relevance by the ability to get it published (in high-ranked journals), although this is likely the common behaviour of scientists, usually depending on their publication output.

Getting tiny p-values is not very unexpected from large studies (you said n=12,000). It only shows you that the data you have is clearly sufficient to recognize the negative effects of fixing the respective coefficient at exactly zero on the model performance (to “explain” the entire body of data according to your chosen statistical model). In practice this means that the information from the data is sufficient to interpret the sign of the coefficient (not yet its actual value; only its relation to zero: it it above or below?).

It may be more relevant to have a look at the entire CI (or “compatibility interval”, as Frank would call it, supposedly). If this does not extend into regions that are of practical relevance, your data is sufficient to “rule out” that this variable has a relevant effect on the weight gain - no matter what the p-value is. Note that this also makes quite clear why a “non-significant” p-value is not demonstrating that there would be no effect, because the CI may well extend into relevant regions, meaning that your data is too noisy to rule out such relevant effects (it is not even sufficient to rule out the the sign is different to the sign of the estimated coefficient).


My guess is that your coefficients are estimated per g increment of carbohydrate or fat, which is not practically relevant. You could probably at least multiply the observed effect by 50 to get a relevant contrast.

Also, when working with self-reported dietary intake data, it is important to consider all the potential sources of measurement error. The recommended approach is to apply some form of energy adjustment, and for macro-nutrients the density method is the most commonly used method. Also, including total energy intake as a covariate is most often recommended, and whether this is done or not is crucial when interpreting the results because you estimate different effects.

As noted in other replies, you’ll want to rescale your predictors to some meaningful interval. Some people like to scale to +/- 1 SD above and below the mean, but if the predictor is skewed using the SD may result in using predictor values that don’t actually exist. For example, if the minimum value of the predictor is 0 with a right skewed distribution with say, a mean of 2 but an SD of 4. If you used +/- 1 SD, you’d be basing the prediction comparing a score of -2 to a 6.

An better alternative is to scale continuous predictors is to the interquartile range. This yields a regression coefficient that is the expected difference Y comparing a case at he 75th percentile of the predictor to a case at the 25th percentile. In other words, it compares a “high” case right in the middle of the upper half of the distribution to a “low” case right in the middle of the lower half of the distribution. The nice thing about this approach is that those values of the predictor will always be value that are possible.

Apart from interpretation, when numbers start getting out to a lot of significant digits, some software may run into issues with precision.

Regarding small but statistically significant effects on weight, i’d just note that drug companies report small differences and yet get approval, eg 5,6kg over 1 year will get you in the NEJM

A little bit off-topic, but a difference of 5.6 kg (8.4 vs 2.8 kg) weight loss after a year isn’t a small difference. On a starting weight of 100 kg (~starting weight in the trial), this is more than 5% of the initial body weight. Clinically, a weight loss of >5% is considered a success, and by this standard, the difference observed in this study is clinically meaningful.

Further, the effect size shouldn’t be a criterion for publication in my opinion. Even if the effect was negligible, if the design was good, why shouldn’t it be published in NEJM?

in the nejm that ‘result’ becomes marketing. the magnitude of the effect will be a marker for publication. it’s not so much a worry about unpublished studies because a drug company won’t run multiple large, multi-million dollar trials until it snags a result. re 5kg, i’m sure the self-isolating encouraged by corona is leading many people to gain/lose 5kg in weeks. but i realise 5kg is considered a ‘success’, i have seen them boast about 0.5kg difference

Since nobody mentioned it. The first coefficient includes zero within the confidence interval.

Thanks so much all ! for your explanation and discussion. I realize the issue it was due to unit conversion mistake and now the issue fixed with reasonable effect sizes

Apart from the answers already given, isn’t this an example of the Jeffreys-Lindley effect? With high power you need a very small p value to attain a reasonable false positive risk.