[Feature] Chained linear regression


#1

We’ve added a new capability to linear_regression_rows: the ability to run multiple regressions with different missingness patterns with one call to the function.

The old (and preserved!) behavior is that passing a list of phenotypes to linear_regression_rows will fit the phenotypes in parallel, but with the caveat that the samples used are the ones for which all phenotypes and covariates are non-missing, i.e. the “intersection” of samples.

For example, if pheno1 and pheno2 are such that no sample has both pheno1 and pheno2 defined, then the following code will drop all samples and fail.

result = hl.linear_regression_rows(
    y=[mt.pheno1, mt.pheno2],
    x=mt.GT.n_alt_alleles(),
    covariates=[1, mt.cov1])

The new behavior is that it is now possible to pass a list of lists (i.e., groups) as the y parameter. Each group of phenotypes is run on the intersection of samples as above, but distinct groups are considered independently with respect to sample missing-ness. For example, the following code will instead regress each phenotype on the subset of samples for which that phenotype (alone) and all covariates are defined:

result = hl.linear_regression_rows(
    y=[[mt.pheno1], [mt.pheno2]],
    x=mt.GT.n_alt_alleles(),
    covariates=[1, mt.cov1])

Here’s a more interesting example that computes, for a group of phenotypes, the result of linear regression for each phenotype on the intersection of samples overall, as well as results stratified by sex:

phenos = [mt.pheno1, mt.pheno2, mt.pheno3, ...]
male_only = [hl.case().when(~mt.is_female, pheno).or_missing() for pheno in phenos]
female_only = [hl.case().when(mt.is_female, pheno).or_missing() for pheno in phenos]

result = hl.linear_regression_rows(
    y=[phenos, male_only, female_only],
    x=mt.GT.n_alt_alleles(),
    covariates=[1, mt.cov1])

Chained logistic regression