Modifying variables within hl.agg.linreg

I am trying to perform linear regression in hail, i.e.:

   mt = mt.annotate_cols(
        lm = hl.agg.linreg(mt.y, [mt.x1, 1]))

but would like to make a small modification. I have two column fields, a and b, that I would like to incorporate in the following way:

  • add one value,a, to the end of y
  • add a second predictor variable, x2, that has 0 in all rows except the last one, where it equals b (corresponding to a at the end of y)
  • add a 0 to the end of x1

Naively, I would think to achieve this with something like:
add a row field called “dummy” that is all zeros, and run:

   mt = mt.annotate_cols(
        lm = hl.agg.linreg(mt.y.append(a), [mt.x1.append(0),mt.dummy.append(b), 1]))

However, it appears that I cannot simply append a value within hl.agg.linreg. Another solution would be to add an entry containing a and b in the appropriate fields, but I also cannot find a simple way to add an entry to the MatrixTable.

Is there a simple way to achieve my goal? In case it helps clarify what I’m asking, the rough idea is that x1 corresponds to gene-level statistics that I am regressing on y, and x2 corresponds to a genome-wide statistic that I am trying to fit simultaneously.

I realize the above may be vague–apologies for any unclear parts or omitted key info.

Hey @ajaynadig ,

I’m sorry hear you’re having trouble with Hail. I need to understand your model first before I can help with the Hail code.

It sounds like you have:

  • a matrix of gene-level statistics for each sample, x_{ij}
  • a phenotype for each sample, y_i
  • a “genome-wide statistic” for each sample, z_i

If I understand correctly, then z_i (your x2, a genome-wide statistic), is just another sample covariate.

i \in [1,N_{samples}] \\ j \in [1,N_{genes}] \\ \\ y_{i} = z_i \alpha + \sum_j x_{ij} \beta_j

You can implement this model in Hail as:

mt = mt.annotate_rows(
   regression_results = hl.agg.linreg(mt.y, [mt.z, mt.x])

This fits z_i for each gene separately. If z_i is orthogonal to every genotype row, then I think the beta for z_i should be, roughly, the same for every gene.

You could also regress away z_i first and then fit the gene-level statistics:

z_results = mt.aggregate_cols(
    hl.agg.linreg(mt.y, [mt.z])
mt = mt.annotate_cols(
   y_without_z = mt.y - z_results.beta * mt.z
mt = mt.annotate_rows(
   regression_results = hl.agg.linreg(mt.y_without_z, [mt.x])

In this setting, you’ll get just one beta for z_i. If z_i is collinear with any of your gene-statistic vectors, then the gene-statistic’s regression will show ~no relationship (because you’ve already accounted for all the variance with z_i).

If you don’t care about the beta on z_i, then you can use hl.linear_regression_rows, which is much faster:

regression_results_table = hl.linear_regression_rows(
    mt.y, mt.x, covariates=[mt.z]

Finally, in any of these settings, remember that Hail, by default, does not include an intercept covariate. If you want your model to include an intercept add 1 to the covariates list in linear_regression_rows or to the x list in hl.agg.linreg.