Annotate row with sample information based on entry criteria

Hi folks—

Hope you’re well. I am trying to annotate the rows of a matrixtable with information on which samples meet a certain entry criterion. At a high level, I want to find samples with an alternate allele, look up information about that sample, and store that information in the row so that I can eventually export the row fields into a .csv file.

As a toy example, let’s say I currently have a matrixtable with the following rows, columns, and entries:

Rows:

  • locus
  • alleles

Columns:

  • sample ID (“s”)
  • blood pressure (“bp”)

Entries:

  • genotype (“GT”)

What I’d like to do is annotate each row (variant) with which samples satisfy GT.is_non_ref() == True and also their bp value. The output could be, for example, an array of structs that is something like s1:bp1, s2:bp2, s3:bp3, etc.

I think the answer may be something like what’s in this post, but I just can’t figure out how to adapt the code. Closest I got is this, which for many reasons doesn’t exactly make sense even to me:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref == True, 
      hl.tstruct(s = hl.agg.collect(mt.s), bp = hl.agg.collect(mt.bp)
   )
)

Thank you,
Jeremy

Super close.

The two issues here are:

  1. that is_non_ref is a class method, so needs parens: is_non_ref(). Also, the == True is redundant since is_non_ref() is already a boolean value.

  2. hl.tstruct is a type (schema), not a constructor for a value. hl.struct constructs a value:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref(), 
      hl.struct(s = hl.agg.collect(mt.s), bp = hl.agg.collect(mt.bp)
   )
)

It’s also perfectly valid to do:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref(), 
      hl.agg.collect(hl.struct(s = mt.s, bp = mt.bp))
   )
)

The difference is that in the top, you’ll get a struct with two fields that contain the array of results for s and bp, and that in the second you’ll get an array with result that contain the struct of both s and bp. If you want to explode afterward to make each result its own row, the second representation will be what you want.

Thank you Tim!! This worked for me (although I have only been able to examine the first 10 rows—for some reason DNAnexus is really sluggish today).

Although the code works, I think I don’t understand exactly how the code is enough to tell Hail what to do. It’s a little magical, which is good in that it seems really powerful, but bad for me because I don’t know if I can reproduce it reliably. In my mind, the pseudocode for a more procedural approach to this would be

  • For each row,
    • For each sample, make an empty array for sample IDs and corresponding blood pressure
      • If the genotype contains a non-reference allele, then add the corresponding sample ID and blood pressure to the array

I don’t understand how the Hail code is specific enough to do all this correctly. For example, I know the argument for annotate_rows is an expression, but in the code snippet mt.GT.is_non_ref(), I believe mt.GT is a table of calls (I think? Sorry, I would have checked but DNAnexus is not working), and is_non_ref() is a method on those calls… so how does annotate_rows() know to only look at the calls corresponding to each individual row?

I think the hl.agg.collect() part makes a bit more sense to me, since I know each GT call knows which matrixtable it came from… although now that I think of it more, then I’m not sure why we need to specify mt.s instead of just 's`.

I wish I could phrase my question better…

Best,
Jeremy