More efficient way to extract calls?

MsUTR · December 10, 2022, 2:38pm

Hi all!

I have a MatrixTable of around 1k individuals, and I am interested in analyzing three individuals:

mt = mt.filter_cols(hl.set(samples_list).contains(mt.s))

Afterward, I am interested in extracting the HGVS notations of damaging variants in a list of genes of interest:

mt_ = mt.filter_rows(hl.set(genes_of_interest_list).contains(mt.gene_symbol))
mt_lof = mt.filter_rows(hl.set(['D']).contains(mt.meta_svm))
mt_lof = mt_lof.filter_rows(hl.len(mt_lof.filters) == 0)
mt_lof = mt_lof.filter_rows(mt_lof.gnomad_af <= 0.001)

mt_lof = mt_lof.group_rows_by(mt_lof.gene_symbol).aggregate(
    damaging = hl.agg.filter(mt_lof.GT.is_non_ref(), hl.agg.collect(mt_lof.hgvs))
)
mt_lof.damaging.export("damaging.tsv")

However, Hail is always stuck in this chunk of code perpetually, and I am unable to find out why from the code. Is there a more efficient way to extract the desired information? Thank you very much.

danking · December 13, 2022, 5:15pm

Hey @MsUTR !

I’m sorry you’re having trouble :/. Sample subsetting is not an efficient operation in Hail’s genotype representation. In particular, Hail’s representation is similar to the VCF representation: each variant is stored as a vector. To read three samples, Hail must read all samples.

OK, but, I think we can make your query a lot faster by using a more efficient filtering query. Hail can avoid reading irrelevant data when your filter_rows uses equality or interval containment on the locus. Do you have your genes as an interval list? If yes, try this:

gene_to_interval = hl.read_table(...)
gene_to_interval = gene_to_interval.filter(genes_of_interest_list.contains(gene_to_interval.gene_name))
intervals_of_interest_list = gene_to_interval.interval.collect()
# intervals_of_interest_list is a Python list of hl.Interval objects
mt = mt.filter_rows(hl.literal(intervals_of_interest_list).contains(mt.locus))

The above code will only read genotypes lying within the genes of interest. That should substantially improve runtime if you’re only using a small fraction of the variants.

I think you should also try using a compressed output format. That will be a lot faster:

mt_log.damaging.export("damaging.tsv.bgz")

My last question: what analysis are you doing with hgvs (are those genotypes?)? If you can perform your analysis directly in Hail, you avoid export/import cost.

MsUTR · December 14, 2022, 6:12pm

Thank you @danking , I will try that out. I am just interested to extract it out in an accessible format (i.e. spreadsheet) to send to others, no downstream analyses are planned yet.

Topic		Replies	Views
Way for slicing on Hail table Hail Query & hailctl	9	513	September 22, 2022
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Good approach to subset a hail matrix Hail Query & hailctl	2	637	April 23, 2022
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	456	December 5, 2023
Filter variants in gvcf Hail Query & hailctl	3	495	November 3, 2020

More efficient way to extract calls?

Related topics