Annotate row with sample information based on entry criteria

jbchang · November 12, 2022, 6:25am

Hi folks—

Hope you’re well. I am trying to annotate the rows of a matrixtable with information on which samples meet a certain entry criterion. At a high level, I want to find samples with an alternate allele, look up information about that sample, and store that information in the row so that I can eventually export the row fields into a .csv file.

As a toy example, let’s say I currently have a matrixtable with the following rows, columns, and entries:

Rows:

locus
alleles

Columns:

sample ID (“s”)
blood pressure (“bp”)

Entries:

genotype (“GT”)

What I’d like to do is annotate each row (variant) with which samples satisfy GT.is_non_ref() == True and also their bp value. The output could be, for example, an array of structs that is something like s1:bp1, s2:bp2, s3:bp3, etc.

I think the answer may be something like what’s in this post, but I just can’t figure out how to adapt the code. Closest I got is this, which for many reasons doesn’t exactly make sense even to me:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref == True, 
      hl.tstruct(s = hl.agg.collect(mt.s), bp = hl.agg.collect(mt.bp)
   )
)

Thank you,
Jeremy

tpoterba · November 12, 2022, 12:14pm

Super close.

The two issues here are:

that is_non_ref is a class method, so needs parens: is_non_ref(). Also, the == True is redundant since is_non_ref() is already a boolean value.
hl.tstruct is a type (schema), not a constructor for a value. hl.struct constructs a value:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref(), 
      hl.struct(s = hl.agg.collect(mt.s), bp = hl.agg.collect(mt.bp)
   )
)

It’s also perfectly valid to do:

mt.annotate_rows(samples_with_variant = 
   hl.agg.filter(mt.GT.is_non_ref(), 
      hl.agg.collect(hl.struct(s = mt.s, bp = mt.bp))
   )
)

The difference is that in the top, you’ll get a struct with two fields that contain the array of results for s and bp, and that in the second you’ll get an array with result that contain the struct of both s and bp. If you want to explode afterward to make each result its own row, the second representation will be what you want.

jbchang · November 13, 2022, 12:46am

Thank you Tim!! This worked for me (although I have only been able to examine the first 10 rows—for some reason DNAnexus is really sluggish today).

Although the code works, I think I don’t understand exactly how the code is enough to tell Hail what to do. It’s a little magical, which is good in that it seems really powerful, but bad for me because I don’t know if I can reproduce it reliably. In my mind, the pseudocode for a more procedural approach to this would be

For each row,
- For each sample, make an empty array for sample IDs and corresponding blood pressure
  - If the genotype contains a non-reference allele, then add the corresponding sample ID and blood pressure to the array

I don’t understand how the Hail code is specific enough to do all this correctly. For example, I know the argument for annotate_rows is an expression, but in the code snippet mt.GT.is_non_ref(), I believe mt.GT is a table of calls (I think? Sorry, I would have checked but DNAnexus is not working), and is_non_ref() is a method on those calls… so how does annotate_rows() know to only look at the calls corresponding to each individual row?

I think the hl.agg.collect() part makes a bit more sense to me, since I know each GT call knows which matrixtable it came from… although now that I think of it more, then I’m not sure why we need to specify mt.s instead of just 's`.

I wish I could phrase my question better…

Best,
Jeremy

Topic		Replies	Views
Add row annotation with label based on entry field of one sample Hail Query & hailctl	2	490	February 11, 2022
Error when trying to annotate a new row with a genotypes of the sample Hail Query & hailctl	2	331	July 13, 2023
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1694	January 8, 2019
Annotate variants with hom var samples Hail Query & hailctl	0	342	January 19, 2023
Variant annotation in MatrixTable Hail Query & hailctl	13	795	July 3, 2020

Annotate row with sample information based on entry criteria

Related topics