Unable to flatten sample/gene counts for table export

Hi there!
I’m new to Hail, and I was hoping you could help me the export of a MatrixTable annotation.
I believe I’m doing something relatively common, i.e. counting the number of variants of a certain category in each gene of each sample.

I’d like to export these counts in a flattened way that’s readable in R, but I can’t find a way to flatten the structure to something like:
sampleName gene nMissense nSynon

sampleName gene nMissense nSynon
s1         ABC  1         0
s1         BCE  0         2
s2         ABC  1         2
s2         BCE  0         1

Alternatively, I’d be happy to export a valid JSON file which I could read with rjson or jsonlite in R.

I’ve tried the following:

  • aggregate and count each variant category separately, i.e. each annotation in a sample contain a key:value structure where the key is the gene
    in this case, each annotation would have the same keys (i.e. same genes), which I don’t know how to combine
  • aggregate and count by gene under a single hl.struct, i.e. the key is the gene, and the values are variant_category:count
    in this way I have everything I need

I’m giving an example of this second, to keep my post as simple as possible.

Annotation step:

annotations = mt.annotate_cols(
                  geneCounts = hl.agg.group_by(
                      mt.info.SYMBOL,
                      hl.struct(
                          nTotalSynonymousGene = hl.agg.count_where(
                                               mt.info.Consequence.contains('synonymous_variant')
                                               & mt.GT.is_non_ref()
                                               & hl.is_defined(mt.GT)),
                          nTotalMissenseGene = hl.agg.count_where(
                                             mt.info.Consequence.contains('missense_variant')
                                             & mt.GT.is_non_ref()
                                             & hl.is_defined(mt.GT)
                          )
                         )
                        )
)

I have then saved the MatrixTable cols() into a separate object, in order to handle a Table I can export in a TSV file (or JSON):

annotations.cols().select('geneCounts').flatten().export('path/test_by_gene.json(or tsv)')

The Table has this structure:

s             geneCounts
sample1       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]
sample2       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]
sample3       [{"key":["RRN3P1"],"value":{"nTotalSynonymousGene":0,"nTotalMissenseGene":0}}...]

But when I try to flatten, nothing happens and a JSON-like file is exported.
I haven’t succeeded exporting a TSV file.
If I read the produced JSON-like file into R, with jsonlite I get an error referring to the formatting indicating trailing garbage, while rjson indicates there’s unquoted strings.

The main tools to use here are the MatrixTable methods group_rows_by().aggregate – as opposed to the hl.agg.group_by aggregator – and then ht.entries().export().

annotations = mt.group_rows_by(gene = mt.info.SYMBOL).aggregate(
                  geneCounts =
                          nTotalSynonymousGene = hl.agg.count_where(
                                               mt.info.Consequence.contains('synonymous_variant')
                                               & mt.GT.is_non_ref()
                                               & hl.is_defined(mt.GT)),
                          nTotalMissenseGene = hl.agg.count_where(
                                             mt.info.Consequence.contains('missense_variant')
                                             & mt.GT.is_non_ref()
                                             & hl.is_defined(mt.GT)
                          )
                         )
)
annotations.select_cols().entries().export(...)

might have extra/missing parens/commas above.

Thank you so much! It worked!

Hey, I have a similar issue which I unfortunately can’t resolve after reading this and similar threads regarding grouping and exporting.

I am doing the grouping like this:

mt = mt.group_rows_by(gene=mt.info.SYMBOL).aggregate(
                  ac = hl.agg.sum(mt.variant_qc.AC[1]),
                  an = hl.agg.mean(mt.variant_qc.AN),
                  hom = hl.agg.sum(mt.variant_qc.homozygote_count[1]))

and exporting is done like this:

mt.select_cols().entries().export('output.txt')

then I get this:

gene	s	ac	an	hom
5S_rRNA	sample1	12	5.8000e+01	2
5S_rRNA	sample2	12	5.8000e+01	2
5S_rRNA	sample3	12	5.8000e+01	2
5S_rRNA	sample4	12	5.8000e+01	2
5S_rRNA	sample5	12	5.8000e+01	2
5S_rRNA	sample6	12	5.8000e+01	2
5S_rRNA	sample7	12	5.8000e+01	2
5S_rRNA	sample8	12	5.8000e+01	2
5S_rRNA	sample9	12	5.8000e+01	2

I understand the concept of having samples multiplied by entries, so that’s not surprising. But how can I get the information only once per gene ? I can do a workaround like so:

mt=mt.filter_cols(mtfge.s == "sample1")
mt.select_cols().entries().export('output.txt')

which results in:

gene			s	ac	an	hom
5S_rRNA		sample1	12	5.8000e+01	2
5_8S_rRNA	sample1	544	5.8000e+01	65
7SK			sample1	12	5.8000e+01	6
A1BG		sample1	26	4.2386e+02	0
A1CF		sample1	27	1.5231e+02	0
A2M			sample1	131	4.2590e+02	1
A2ML1		sample1	52	2.6562e+02	1
A2ML1-AS1	sample1	8	5.8000e+01	0
A2MP1		sample1	2	5.8000e+01	0

but this is of course not the best way to do it. Can you cut my gordian-knot in my head to solve this and help me better understand the concept behind? Thanks a lot !