Count all gene strings in an array field with a counter aggregator for a whole table

rjwswenson · November 12, 2021, 1:01pm

Hi,

I have an ANN entry field within info, that I want to display a count of the specific gene symbol or name, using the aggregator.

my code is as follows where anngen is the MT:

anngen.aggregate_entries(hl.agg.explode(lambda element: hl.agg.counter(element), anngen.info.ANN))

this returns:

2021-11-12 12:53:32 Hail: INFO: Coerced prefix-sorted dataset

frozendict({’|3_prime_UTR_variant|MODIFIER|LACE1|ENSG00000135537|transcript|ENST00000368977|protein_coding|13/13|c.|||||1556|INFO_REALIGN_3_PRIME’: 2504, ‘|3_prime_UTR_variant|MODIFIER|MIER1|ENSG00000198160|transcript|ENST00000355356|protein_coding|13/13|c.|||||904|INFO_REALIGN_3_PRIME’: 2504,

I believe the gene names are in the 4th position, array element 3. The above was great for exploding and aggregating counts of the entire ANN , but I am specifically wanting to enumerate the gene name list.

tpoterba · November 12, 2021, 8:25pm

I think you can do:

anngen.aggregate_entries(
  hl.agg.explode(lambda element: hl.agg.counter(element), 
                 anngen.info.ANN.split('|')[3]))

rjwswenson · November 13, 2021, 3:06pm

Hmm,

I got this, in response:

AttributeError: ‘ArrayExpression’ object has no attribute ‘split’

I am reviewing the ArrayExpression type now.

I am able to split with:

smallgen.info.ANN[0].split('|')[3].show(4)

but this oddly causes the entire ANN string array to be further rendered as comma separated individual characters as the return.

locus
alleles
locus array str
chr1:16103 [“T”,“G”] “o”
chr1:51479 [“T”,“A”] “n”
chr1:51898 [“C”,“A”] “n”
chr1:51928 [“G”,“A”] “n”
chr1:51954 [“G”,“C”] “n”
chr1:54490 [“G”,“A”] “n”
chr1:54669 [“C”,“T”] “n”
chr1:54708 [“G”,“C”] “n”
chr1:54716 [“C”,“T”] “n”
chr1:54725 [“T”,“G”] “n”

Topic		Replies	Views
Count all members in an array field with counter aggregator for whole table Hail Query & hailctl	3	382	August 29, 2021
Gene-based GWAS Hail Query & hailctl	2	453	April 29, 2020
Unable to flatten sample/gene counts for table export Hail Query & hailctl	4	475	March 21, 2024
Count the number of alleles at each site Hail Query & hailctl	2	198	January 4, 2024
Aggregating a numeric array Hail Query & hailctl	2	237	April 11, 2024

Count all gene strings in an array field with a counter aggregator for a whole table

Related topics