Compound hets and array<str> to list help

danchubb · May 12, 2020, 4:02pm

Hi,

I asked a while ago about on the zulip chat how to analyse compound hets in a hail mt. I’ve been plugging away in the background and I’m coming close to something at least a little informative but I’ve hit a couple of specific walls it would be great to get some help with.

I have made a gene annotation after annotating with vep and using the gnomad vep utils

mt=mt.annotate_rows(gene=mt.vep.worst_csq_by_gene_canonical.gene_symbol)
mt=mt.annotate_rows(hgvsc=mt.vep.worst_csq_by_gene_canonical.hgvsc)

I then annotate a column with the variants present in an individual gene:

 r=mt.annotate_cols(hets=hl.agg.group_by(mt.gene,hl.agg.filter(mt.GT.is_het(),hl.agg.collect(mt.hgvsc))))`

Which looks like this:

I hope that image uploads but if not, it creates an annotation of
e.g.

dict<str, array<str>>
 {"MCM8":["ENST00000378896.3:c.421A>G","ENST00000378896.3:c.464G>A"]}}

What I am aiming to do is convert the array for each gene in to pairwise combinations of hets.

If this was a list in python I would just do itertools.permutations(listname,2)

I’ve tried but failed to figure out the best way of doing thing in hail - hopefully just convert in to a list and run itertools.

I tried

z=mt.annotate_cols(pairwise=hl.array(mt.hets).map(lambda x: itertools.permutations(x[1].collect(),2)))

Thinking that .collect() turns an array in to a list but that fails.

Any ideas?

After this I will need to somehow aggregate the pairwise data across case and control status in order to filter those pairwise combinations present in > n controls. That’s the next problem though…

Thanks a lot,

Dan

danking · May 12, 2020, 4:17pm

Hey @danchubb, I’m sorry Hail hasn’t made this easy for you!

Hail doesn’t support “user-defined functions,” so calling python libraries like itertools.permutations directly won’t work. Collect converts something from a distributed object to a local object. Using collect inside a annotate_... doesn’t make sense because annotate_... is modifying the rows or columns of a distributed object in parallel.

Luckily, column metadata is typically small and you can just convert the column data into a python object with collect, do the operations in python, then bring the data back into the column metadata:

hets_as_python = mt.select_cols('hets').col.collect()
new_hets_as_python = [
    x.annotate(hets = itertools.permutations(x.hets, 2))
    for x in cols_as_python]
new_hets_as_hail = hl.Table.parallelize(new_hets_as_python).key_by('s')
mt = mt.annotate_cols(new_hets = new_hets_as_hail[mt.col_key].hets)

danchubb · May 12, 2020, 5:10pm

That’s really helpful Dan. Looks good so far, I’ll carry on and see if I can get it to do what I need.

Thanks so much,

Dan

Topic		Replies	Views
Compound heterozygote analysis? Hail Query & hailctl	3	452	May 17, 2022
Gene-based sample statistics Hail Query & hailctl	6	736	January 27, 2020
Converting array<array<str>> expression to array<str> Hail Query & hailctl	2	555	October 20, 2021
Is there an easy way to make a struct from list of columns in hail table? Hail Query & hailctl	3	468	November 16, 2021
Error when trying to annotate a new row with a genotypes of the sample Hail Query & hailctl	2	330	July 13, 2023

Compound hets and array<str> to list help

Related topics