Hi,
I have an ANN entry field within info, that I want to display a count of the specific gene symbol or name, using the aggregator.
my code is as follows where anngen is the MT:
anngen.aggregate_entries(hl.agg.explode(lambda element: hl.agg.counter(element), anngen.info.ANN))
this returns:
2021-11-12 12:53:32 Hail: INFO: Coerced prefix-sorted dataset
frozendict({β|3_prime_UTR_variant|MODIFIER|LACE1|ENSG00000135537|transcript|ENST00000368977|protein_coding|13/13|c.|||||1556|INFO_REALIGN_3_PRIMEβ: 2504, β|3_prime_UTR_variant|MODIFIER|MIER1|ENSG00000198160|transcript|ENST00000355356|protein_coding|13/13|c.|||||904|INFO_REALIGN_3_PRIMEβ: 2504,
I believe the gene names are in the 4th position, array element 3. The above was great for exploding and aggregating counts of the entire ANN , but I am specifically wanting to enumerate the gene name list.
I think you can do:
anngen.aggregate_entries(
hl.agg.explode(lambda element: hl.agg.counter(element),
anngen.info.ANN.split('|')[3]))
1 Like
Hmm,
I got this, in response:
AttributeError: βArrayExpressionβ object has no attribute βsplitβ
I am reviewing the ArrayExpression type now.
I am able to split with:
smallgen.info.ANN[0].split('|')[3].show(4)
but this oddly causes the entire ANN string array to be further rendered as comma separated individual characters as the return.
locus
alleles
locus array str
chr1:16103 [βTβ,βGβ] βoβ
chr1:51479 [βTβ,βAβ] βnβ
chr1:51898 [βCβ,βAβ] βnβ
chr1:51928 [βGβ,βAβ] βnβ
chr1:51954 [βGβ,βCβ] βnβ
chr1:54490 [βGβ,βAβ] βnβ
chr1:54669 [βCβ,βTβ] βnβ
chr1:54708 [βGβ,βCβ] βnβ
chr1:54716 [βCβ,βTβ] βnβ
chr1:54725 [βTβ,βGβ] βnβ