Count all gene strings in an array field with a counter aggregator for a whole table

Hi,

I have an ANN entry field within info, that I want to display a count of the specific gene symbol or name, using the aggregator.

my code is as follows where anngen is the MT:

anngen.aggregate_entries(hl.agg.explode(lambda element: hl.agg.counter(element), anngen.info.ANN))

this returns:

2021-11-12 12:53:32 Hail: INFO: Coerced prefix-sorted dataset

frozendict({’|3_prime_UTR_variant|MODIFIER|LACE1|ENSG00000135537|transcript|ENST00000368977|protein_coding|13/13|c.|||||1556|INFO_REALIGN_3_PRIME’: 2504, β€˜|3_prime_UTR_variant|MODIFIER|MIER1|ENSG00000198160|transcript|ENST00000355356|protein_coding|13/13|c.|||||904|INFO_REALIGN_3_PRIME’: 2504,

I believe the gene names are in the 4th position, array element 3. The above was great for exploding and aggregating counts of the entire ANN , but I am specifically wanting to enumerate the gene name list.

I think you can do:

anngen.aggregate_entries(
  hl.agg.explode(lambda element: hl.agg.counter(element), 
                 anngen.info.ANN.split('|')[3]))


1 Like

Hmm,

I got this, in response:

AttributeError: β€˜ArrayExpression’ object has no attribute β€˜split’

I am reviewing the ArrayExpression type now.

I am able to split with:

smallgen.info.ANN[0].split('|')[3].show(4)

but this oddly causes the entire ANN string array to be further rendered as comma separated individual characters as the return.

locus
alleles
locus array str
chr1:16103 [β€œT”,β€œG”] β€œo”
chr1:51479 [β€œT”,β€œA”] β€œn”
chr1:51898 [β€œC”,β€œA”] β€œn”
chr1:51928 [β€œG”,β€œA”] β€œn”
chr1:51954 [β€œG”,β€œC”] β€œn”
chr1:54490 [β€œG”,β€œA”] β€œn”
chr1:54669 [β€œC”,β€œT”] β€œn”
chr1:54708 [β€œG”,β€œC”] β€œn”
chr1:54716 [β€œC”,β€œT”] β€œn”
chr1:54725 [β€œT”,β€œG”] β€œn”