We just fixed a bug in the linreg aggregator that has existed since release 0.2.29, and which will be fixed in 0.2.38. (It is fixed by this PR, which will merge into master shortly.)
First off, any regressions containing at most 8 covariates (including the intercept), and where the intercept is first in the list of covariates, are not affected by this bug. E.g.
hl.agg.linreg(mt.y, [1, mt.x]) is safe.
If the list of covariates is
[x0, x1, ...], then the bug causes rows to be included in the regression even if
x0, or any of
x8, x9, ..., are missing. In this case uninitialized memory will be used for the values of the missing covariates, which could be any value.
In regressions with more than 32 covariates, this bug only affects the tail of the list. Specifically, if there are
k covariates, the last
k % 32 (the remainder after dividing by 32) are affected. If any covariates outside that tail are missing, the row will be correctly skipped.
This is a subtle and concerning bug. If you have any questions about whether any of your pipelines have been affected by this, please don’t hesitate to ask.