Hi,
I asked a while ago about on the zulip chat how to analyse compound hets in a hail mt. I’ve been plugging away in the background and I’m coming close to something at least a little informative but I’ve hit a couple of specific walls it would be great to get some help with.
I have made a gene annotation after annotating with vep and using the gnomad vep utils
mt=mt.annotate_rows(gene=mt.vep.worst_csq_by_gene_canonical.gene_symbol)
mt=mt.annotate_rows(hgvsc=mt.vep.worst_csq_by_gene_canonical.hgvsc)
I then annotate a column with the variants present in an individual gene:
r=mt.annotate_cols(hets=hl.agg.group_by(mt.gene,hl.agg.filter(mt.GT.is_het(),hl.agg.collect(mt.hgvsc))))`
Which looks like this:
I hope that image uploads but if not, it creates an annotation of
e.g.
dict<str, array<str>>
{"MCM8":["ENST00000378896.3:c.421A>G","ENST00000378896.3:c.464G>A"]}}
What I am aiming to do is convert the array for each gene in to pairwise combinations of hets.
If this was a list in python I would just do itertools.permutations(listname,2)
I’ve tried but failed to figure out the best way of doing thing in hail - hopefully just convert in to a list and run itertools.
I tried
z=mt.annotate_cols(pairwise=hl.array(mt.hets).map(lambda x: itertools.permutations(x[1].collect(),2)))
Thinking that .collect() turns an array in to a list but that fails.
Any ideas?
After this I will need to somehow aggregate the pairwise data across case and control status in order to filter those pairwise combinations present in > n controls. That’s the next problem though…
Thanks a lot,
Dan