However, Hail is always stuck in this chunk of code perpetually, and I am unable to find out why from the code. Is there a more efficient way to extract the desired information? Thank you very much.
I’m sorry you’re having trouble :/. Sample subsetting is not an efficient operation in Hail’s genotype representation. In particular, Hail’s representation is similar to the VCF representation: each variant is stored as a vector. To read three samples, Hail must read all samples.
OK, but, I think we can make your query a lot faster by using a more efficient filtering query. Hail can avoid reading irrelevant data when your filter_rows uses equality or interval containment on the locus. Do you have your genes as an interval list? If yes, try this:
gene_to_interval = hl.read_table(...)
gene_to_interval = gene_to_interval.filter(genes_of_interest_list.contains(gene_to_interval.gene_name))
intervals_of_interest_list = gene_to_interval.interval.collect()
# intervals_of_interest_list is a Python list of hl.Interval objects
mt = mt.filter_rows(hl.literal(intervals_of_interest_list).contains(mt.locus))
The above code will only read genotypes lying within the genes of interest. That should substantially improve runtime if you’re only using a small fraction of the variants.
I think you should also try using a compressed output format. That will be a lot faster:
mt_log.damaging.export("damaging.tsv.bgz")
My last question: what analysis are you doing with hgvs (are those genotypes?)? If you can perform your analysis directly in Hail, you avoid export/import cost.
Thank you @danking , I will try that out. I am just interested to extract it out in an accessible format (i.e. spreadsheet) to send to others, no downstream analyses are planned yet.