Hi Hail team,
Thanks so much for developing the tool. I am currently trying to develop a tutorial of AllofUs for using Hail (although myself is also a new user). I have a large dataset, and I want to randomly select a small subset of variants across the genome, as well as a small region in a chromosome that I know has hits, and export this subset as a matrix table so that folks can run GWAS much faster.
My current strategy is as follow:
assign random boolean to variant and select
mt = mt.annotate_rows(rand=hl.rand_bool(0.001))
mt1 = mt.filter_rows(mt.rand == True)
select a small section of the genome
test_intervals = [‘chr10:111M-113M’]
mt2 = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])
combine two tables, and write matrix table
mt_tutorial = mt1.union_rows(mt2)
mt_tutorial = mt_tutorial.drop(‘rand’)
mt_tutorial.write(’{bucket}/data/mt_tutorial.mt’)
However, I found this process takes very long to run (14hrs still get nothing). Is there a more efficient way to do this?
Best,
Taotao