Greetings, I am trying to do a compound heterozygotes analysis for a reasonably large dataset (~600k variants, 1k samples). From an earlier post on the forum, I am trying to use this code:
mt =mt.annotate_cols(hets=hl.agg.group_by(mt.gene_symbol,hl.agg.filter(mt.GT.is_het(),hl.agg.collect(mt.hgvs))))
However, I am unable to even compute the hets
field, presumably because I have a large no. of genes and samples. Is there a way to just keep genes with more than one heterozygous calls, or are there any built-in function within Hail for this purpose?
I think you’ll have a better time in a two step process:
mt = mt.group_rows_by(mt.gene_symbol).aggregate(
compound_hets = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.hgvs))
)
Note: this uses group_rows_by
which is a shuffling operation. That means you’ll want to use non-preemptible / non-spot VMs if you’re running on a cloud cluster. More details on shuffling here.
Thank you very much danking. As this annotate entries, is there any way I can annotate the cols with genes that have >1 heterozygous calls? I am still new to this, appreciate any help !
Thank you Danking! Is there a way to aggregate by column (samples) for genes that have >1 heterozygous calls?