Good approach to subset a hail matrix

Hi Hail team,

Thanks so much for developing the tool. I am currently trying to develop a tutorial of AllofUs for using Hail (although myself is also a new user). I have a large dataset, and I want to randomly select a small subset of variants across the genome, as well as a small region in a chromosome that I know has hits, and export this subset as a matrix table so that folks can run GWAS much faster.

My current strategy is as follow:

assign random boolean to variant and select

mt = mt.annotate_rows(rand=hl.rand_bool(0.001))
mt1 = mt.filter_rows(mt.rand == True)

select a small section of the genome

test_intervals = [‘chr10:111M-113M’]
mt2 = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])

combine two tables, and write matrix table

mt_tutorial = mt1.union_rows(mt2)
mt_tutorial = mt_tutorial.drop(‘rand’)

mt_tutorial.write(’{bucket}/data/mt_tutorial.mt’)

However, I found this process takes very long to run (14hrs still get nothing). Is there a more efficient way to do this?

Best,
Taotao

1 Like

This approach requires reading the entire matrixtable input in order to generate the random variants over the genome for mt1. What configuration are you using to run this pipeline? How long does it take, for instance, to read and mt.summarize()?

Thanks for your reply. I think I have found a better way to handle this:

Using basic python function to randomly create thousands of small intervals across the whole genome, and ask Hail to parse these intervals. I found that way I can create a complete Manhattan plot with less than 5 minutes.

Thanks again,
Taotao