Good approach to subset a hail matrix

TaotaoTan · April 22, 2022, 4:35pm

Hi Hail team,

Thanks so much for developing the tool. I am currently trying to develop a tutorial of AllofUs for using Hail (although myself is also a new user). I have a large dataset, and I want to randomly select a small subset of variants across the genome, as well as a small region in a chromosome that I know has hits, and export this subset as a matrix table so that folks can run GWAS much faster.

My current strategy is as follow:

assign random boolean to variant and select

mt = mt.annotate_rows(rand=hl.rand_bool(0.001))
mt1 = mt.filter_rows(mt.rand == True)

select a small section of the genome

test_intervals = [‘chr10:111M-113M’]
mt2 = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])

combine two tables, and write matrix table

mt_tutorial = mt1.union_rows(mt2)
mt_tutorial = mt_tutorial.drop(‘rand’)

mt_tutorial.write(’{bucket}/data/mt_tutorial.mt’)

However, I found this process takes very long to run (14hrs still get nothing). Is there a more efficient way to do this?

Best,
Taotao

tpoterba · April 23, 2022, 2:59pm

This approach requires reading the entire matrixtable input in order to generate the random variants over the genome for mt1. What configuration are you using to run this pipeline? How long does it take, for instance, to read and mt.summarize()?

TaotaoTan · April 23, 2022, 6:47pm

Thanks for your reply. I think I have found a better way to handle this:

Using basic python function to randomly create thousands of small intervals across the whole genome, and ask Hail to parse these intervals. I found that way I can create a complete Manhattan plot with less than 5 minutes.

Thanks again,
Taotao

Topic		Replies	Views
Random sampling of rows from Mt Hail Query & hailctl	2	289	June 27, 2023
More efficient way to extract calls? Hail Query & hailctl	2	370	December 14, 2022
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1718	December 20, 2018
Subsetting large data Hail Query & hailctl	3	721	September 28, 2022

Good approach to subset a hail matrix

assign random boolean to variant and select

select a small section of the genome

combine two tables, and write matrix table

Related topics