Hi all,

I was wondering, how is linear regression in Hail implemented? Could someone point me to where the actual computation of weights happen? Most importantly, *how* are covariates treated? Are they simply treated as just another component in the linear equation or do they get any special treatment (e.g. is a separate model fitted on just the covariates first?). This is not entirely clear in the documentation.

Thanks

Andreas

Hi Andreas,

I assume you’re referring to `linear_regression_rows`

? Unfortunately the implementation is a bit hard to read. Mathematically, the covariates don’t get any special treatment, they’re just another variable in the model alongside the genotype. However, instead of independently performing a multivariate regression on every row, we take advantage of the fact that the covariates are constant across rows, and do something like fitting a separate model once before performing the per-row regressions. But again, this is just an optimization, and is mathematically equivalent to fitting a standard linear model independently per row.

Does that answer your question?

Best.

Patrick

Could you clarify what does “row” refer to here? If we assume a data matrix X where each row is a patient and each column is variant/feature, why are covariates constant? If for example I perform PCA and get scores to use as covariates, wouldn’t the first, second, third,etc PC score be different from one patient to the other?

I was referring to the `linear_regression_rows`

method, which performs a linear regression per row of a matrix table. This is most often used where each row is a variant, and each column is a sample/patient. This has historically been the standard representation of genetic data in hail, because a matrix table is partitioned/distributed across rows, and there have typically been many more variants than samples (though that is becoming less true). In this case, covariates (which in `linear_regression_rows`

must be column fields) are constant across rows/variants/features.

Thank you! What if there is correlation between the covariates and some of the remaining variables. Wouldn’t fitting a separate model with only the covariates “break” this collinearity? Is it mathematically guaranteed that the model is fitted in such a way such that if you were to fit a single model to all of the variables + covariates together, you would end up with the same weights/betas for all variables and all covariates?

Yes, it is mathematically guaranteed that what we compute is the same thing you would get by fitting a single model to all variables and covariates together.

Jon Bloom described some of this in sections 1-3 of this arxiv pre-print

Thank you both! That answers my question