Linear regression explanation

ag14774 · June 26, 2023, 10:33am

Hi all,

I was wondering, how is linear regression in Hail implemented? Could someone point me to where the actual computation of weights happen? Most importantly, how are covariates treated? Are they simply treated as just another component in the linear equation or do they get any special treatment (e.g. is a separate model fitted on just the covariates first?). This is not entirely clear in the documentation.

Thanks
Andreas

patrick-schultz · June 26, 2023, 1:45pm

Hi Andreas,

I assume you’re referring to linear_regression_rows? Unfortunately the implementation is a bit hard to read. Mathematically, the covariates don’t get any special treatment, they’re just another variable in the model alongside the genotype. However, instead of independently performing a multivariate regression on every row, we take advantage of the fact that the covariates are constant across rows, and do something like fitting a separate model once before performing the per-row regressions. But again, this is just an optimization, and is mathematically equivalent to fitting a standard linear model independently per row.

Does that answer your question?

Best.
Patrick

ag14774 · June 26, 2023, 1:56pm

Could you clarify what does “row” refer to here? If we assume a data matrix X where each row is a patient and each column is variant/feature, why are covariates constant? If for example I perform PCA and get scores to use as covariates, wouldn’t the first, second, third,etc PC score be different from one patient to the other?

patrick-schultz · June 26, 2023, 2:05pm

I was referring to the linear_regression_rows method, which performs a linear regression per row of a matrix table. This is most often used where each row is a variant, and each column is a sample/patient. This has historically been the standard representation of genetic data in hail, because a matrix table is partitioned/distributed across rows, and there have typically been many more variants than samples (though that is becoming less true). In this case, covariates (which in linear_regression_rows must be column fields) are constant across rows/variants/features.

ag14774 · June 26, 2023, 3:37pm

Thank you! What if there is correlation between the covariates and some of the remaining variables. Wouldn’t fitting a separate model with only the covariates “break” this collinearity? Is it mathematically guaranteed that the model is fitted in such a way such that if you were to fit a single model to all of the variables + covariates together, you would end up with the same weights/betas for all variables and all covariates?

patrick-schultz · June 26, 2023, 4:12pm

Yes, it is mathematically guaranteed that what we compute is the same thing you would get by fitting a single model to all variables and covariates together.

danking · June 26, 2023, 8:23pm

Jon Bloom described some of this in sections 1-3 of this arxiv pre-print

ag14774 · June 27, 2023, 7:47am

Thank you both! That answers my question

Topic		Replies	Views
Clarification on Linear Model in Hail: genetic relatedness and covariate Hail Query & hailctl	0	19	March 28, 2025
P-value differ from R for linear regression Development	6	580	March 31, 2021
Modifying variables within hl.agg.linreg Science	1	429	October 13, 2021
Improve writing time for GWAS results Hail Query & hailctl	2	461	November 20, 2020
Linear regression per column with entry fields Hail Query & hailctl	2	340	August 28, 2021

Linear regression explanation

Related topics