I was wondering, how is linear regression in Hail implemented? Could someone point me to where the actual computation of weights happen? Most importantly, how are covariates treated? Are they simply treated as just another component in the linear equation or do they get any special treatment (e.g. is a separate model fitted on just the covariates first?). This is not entirely clear in the documentation.
I assume you’re referring to
linear_regression_rows? Unfortunately the implementation is a bit hard to read. Mathematically, the covariates don’t get any special treatment, they’re just another variable in the model alongside the genotype. However, instead of independently performing a multivariate regression on every row, we take advantage of the fact that the covariates are constant across rows, and do something like fitting a separate model once before performing the per-row regressions. But again, this is just an optimization, and is mathematically equivalent to fitting a standard linear model independently per row.
Does that answer your question?
Could you clarify what does “row” refer to here? If we assume a data matrix X where each row is a patient and each column is variant/feature, why are covariates constant? If for example I perform PCA and get scores to use as covariates, wouldn’t the first, second, third,etc PC score be different from one patient to the other?
I was referring to the
linear_regression_rows method, which performs a linear regression per row of a matrix table. This is most often used where each row is a variant, and each column is a sample/patient. This has historically been the standard representation of genetic data in hail, because a matrix table is partitioned/distributed across rows, and there have typically been many more variants than samples (though that is becoming less true). In this case, covariates (which in
linear_regression_rows must be column fields) are constant across rows/variants/features.
Thank you! What if there is correlation between the covariates and some of the remaining variables. Wouldn’t fitting a separate model with only the covariates “break” this collinearity? Is it mathematically guaranteed that the model is fitted in such a way such that if you were to fit a single model to all of the variables + covariates together, you would end up with the same weights/betas for all variables and all covariates?
Yes, it is mathematically guaranteed that what we compute is the same thing you would get by fitting a single model to all variables and covariates together.
Jon Bloom described some of this in sections 1-3 of this arxiv pre-print
Thank you both! That answers my question