Possible incorrect linreg aggregator results in 0.2.29 - 0.2.37

patrick-schultz · April 17, 2020, 8:26pm

We just fixed a bug in the linreg aggregator that has existed since release 0.2.29, and which will be fixed in 0.2.38. (It is fixed by this PR, which will merge into master shortly.)

First off, any regressions containing at most 8 covariates (including the intercept), and where the intercept is first in the list of covariates, are not affected by this bug. E.g. hl.agg.linreg(mt.y, [1, mt.x]) is safe.

If the list of covariates is [x0, x1, ...], then the bug causes rows to be included in the regression even if x0, or any of x8, x9, ..., are missing. In this case uninitialized memory will be used for the values of the missing covariates, which could be any value.

In regressions with more than 32 covariates, this bug only affects the tail of the list. Specifically, if there are k covariates, the last k % 32 (the remainder after dividing by 32) are affected. If any covariates outside that tail are missing, the row will be correctly skipped.

This is a subtle and concerning bug. If you have any questions about whether any of your pipelines have been affected by this, please don’t hesitate to ask.

tpoterba · April 17, 2020, 8:37pm

Note that linear_regression_rows is not affected by this bug.

Also note that this is also likely to result in type 1 error (p-values closer to 1 than the truth) rather than type 2 error (spurious small p-values).

Topic		Replies	Views
Use agg.linreg for GWAS Hail Query & hailctl	2	425	June 30, 2020
[Feature] Chained linear regression Updates	0	982	October 26, 2018
Parsing results from regression on multiple phenotypes Hail Query & hailctl	0	12	April 25, 2025
P-value differ from R for linear regression Development	6	580	March 31, 2021
Modifying variables within hl.agg.linreg Science	1	429	October 13, 2021

Possible incorrect linreg aggregator results in 0.2.29 - 0.2.37

Related topics