Unable to flatten sample/gene counts for table export

Hi there!
I’m new to Hail, and I was hoping you could help me the export of a MatrixTable annotation.
I believe I’m doing something relatively common, i.e. counting the number of variants of a certain category in each gene of each sample.

I’d like to export these counts in a flattened way that’s readable in R, but I can’t find a way to flatten the structure to something like:
sampleName gene nMissense nSynon

sampleName gene nMissense nSynon
s1         ABC  1         0
s1         BCE  0         2
s2         ABC  1         2
s2         BCE  0         1

Alternatively, I’d be happy to export a valid JSON file which I could read with rjson or jsonlite in R.

I’ve tried the following:

  • aggregate and count each variant category separately, i.e. each annotation in a sample contain a key:value structure where the key is the gene
    in this case, each annotation would have the same keys (i.e. same genes), which I don’t know how to combine
  • aggregate and count by gene under a single hl.struct, i.e. the key is the gene, and the values are variant_category:count
    in this way I have everything I need

I’m giving an example of this second, to keep my post as simple as possible.

Annotation step:

annotations = mt.annotate_cols(
                  geneCounts = hl.agg.group_by(
                      mt.info.SYMBOL,
                      hl.struct(
                          nTotalSynonymousGene = hl.agg.count_where(
                                               mt.info.Consequence.contains('synonymous_variant')
                                               & mt.GT.is_non_ref()
                                               & hl.is_defined(mt.GT)),
                          nTotalMissenseGene = hl.agg.count_where(
                                             mt.info.Consequence.contains('missense_variant')
                                             & mt.GT.is_non_ref()
                                             & hl.is_defined(mt.GT)
                          )
                         )
                        )
)

I have then saved the MatrixTable cols() into a separate object, in order to handle a Table I can export in a TSV file (or JSON):

annotations.cols().select('geneCounts').flatten().export('path/test_by_gene.json(or tsv)')

The Table has this structure:

s             geneCounts
sample1       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]
sample2       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]
sample3       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]

But when I try to flatten, nothing happens and a JSON-like file is exported.
I haven’t succeeded exporting a TSV file.
If I read the produced JSON-like file into R, with jsonlite I get an error referring to the formatting indicating trailing garbage, while rjson indicates there’s unquoted strings.

The main tools to use here are the MatrixTable methods group_rows_by().aggregate – as opposed to the hl.agg.group_by aggregator – and then ht.entries().export().

annotations = mt.group_rows_by(gene = mt.info.SYMBOL).aggregate(
                  geneCounts =
                          nTotalSynonymousGene = hl.agg.count_where(
                                               mt.info.Consequence.contains('synonymous_variant')
                                               & mt.GT.is_non_ref()
                                               & hl.is_defined(mt.GT)),
                          nTotalMissenseGene = hl.agg.count_where(
                                             mt.info.Consequence.contains('missense_variant')
                                             & mt.GT.is_non_ref()
                                             & hl.is_defined(mt.GT)
                          )
                         )
)
annotations.select_cols().entries().export(...)

might have extra/missing parens/commas above.

Thank you so much! It worked!