Compound heterozygote analysis?

rna · May 16, 2022, 2:36pm

Greetings, I am trying to do a compound heterozygotes analysis for a reasonably large dataset (~600k variants, 1k samples). From an earlier post on the forum, I am trying to use this code:
mt =mt.annotate_cols(hets=hl.agg.group_by(mt.gene_symbol,hl.agg.filter(mt.GT.is_het(),hl.agg.collect(mt.hgvs))))
However, I am unable to even compute the hets field, presumably because I have a large no. of genes and samples. Is there a way to just keep genes with more than one heterozygous calls, or are there any built-in function within Hail for this purpose?

danking · May 16, 2022, 2:46pm

I think you’ll have a better time in a two step process:

mt = mt.group_rows_by(mt.gene_symbol).aggregate(
    compound_hets = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.hgvs))
)

Note: this uses group_rows_by which is a shuffling operation. That means you’ll want to use non-preemptible / non-spot VMs if you’re running on a cloud cluster. More details on shuffling here.

rna · May 16, 2022, 7:23pm

Thank you very much danking. As this annotate entries, is there any way I can annotate the cols with genes that have >1 heterozygous calls? I am still new to this, appreciate any help !

rna · May 17, 2022, 12:50am

Thank you Danking! Is there a way to aggregate by column (samples) for genes that have >1 heterozygous calls?

Topic		Replies	Views
Compound hets and array<str> to list help Hail Query & hailctl	2	555	May 12, 2020
Error when trying to annotate a new row with a genotypes of the sample Hail Query & hailctl	2	330	July 13, 2023
Gene-based sample statistics Hail Query & hailctl	6	736	January 27, 2020
Adding sample IDs of those with non-ref GT as row variable Hail Query & hailctl	2	328	March 17, 2022
How to stat AC for each population Help [0.1]	7	1191	December 29, 2018

Compound heterozygote analysis?

Related topics