More efficient way to extract calls?

Hi all!

I have a MatrixTable of around 1k individuals, and I am interested in analyzing three individuals:

mt = mt.filter_cols(hl.set(samples_list).contains(mt.s))

Afterward, I am interested in extracting the HGVS notations of damaging variants in a list of genes of interest:

mt_ = mt.filter_rows(hl.set(genes_of_interest_list).contains(mt.gene_symbol))
mt_lof = mt.filter_rows(hl.set(['D']).contains(mt.meta_svm))
mt_lof = mt_lof.filter_rows(hl.len(mt_lof.filters) == 0)
mt_lof = mt_lof.filter_rows(mt_lof.gnomad_af <= 0.001)

mt_lof = mt_lof.group_rows_by(mt_lof.gene_symbol).aggregate(
    damaging = hl.agg.filter(mt_lof.GT.is_non_ref(), hl.agg.collect(mt_lof.hgvs))
)
mt_lof.damaging.export("damaging.tsv")

However, Hail is always stuck in this chunk of code perpetually, and I am unable to find out why from the code. Is there a more efficient way to extract the desired information? Thank you very much.

Hey @MsUTR !

I’m sorry you’re having trouble :/. Sample subsetting is not an efficient operation in Hail’s genotype representation. In particular, Hail’s representation is similar to the VCF representation: each variant is stored as a vector. To read three samples, Hail must read all samples.


OK, but, I think we can make your query a lot faster by using a more efficient filtering query. Hail can avoid reading irrelevant data when your filter_rows uses equality or interval containment on the locus. Do you have your genes as an interval list? If yes, try this:

gene_to_interval = hl.read_table(...)
gene_to_interval = gene_to_interval.filter(genes_of_interest_list.contains(gene_to_interval.gene_name))
intervals_of_interest_list = gene_to_interval.interval.collect()
# intervals_of_interest_list is a Python list of hl.Interval objects
mt = mt.filter_rows(hl.literal(intervals_of_interest_list).contains(mt.locus))

The above code will only read genotypes lying within the genes of interest. That should substantially improve runtime if you’re only using a small fraction of the variants.


I think you should also try using a compressed output format. That will be a lot faster:

mt_log.damaging.export("damaging.tsv.bgz")

My last question: what analysis are you doing with hgvs (are those genotypes?)? If you can perform your analysis directly in Hail, you avoid export/import cost.

Thank you @danking , I will try that out. I am just interested to extract it out in an accessible format (i.e. spreadsheet) to send to others, no downstream analyses are planned yet.

1 Like