Cluster Size for Subsetting in Hail

ruchim · March 10, 2020, 6:11pm

We’re trying to figure out the size of a cluster to use for subsetting a few VCFs, details below.

There are 22 *.vcf.bgz files, a total of about ~350GiB
They need to be partitioned by 48 sample lists – so creating a total of 22 * 48, outputting a total of 1056 vcf files.

We’re thinking to loop over each VCF and create the subsets – what is the size (cpu/mem/disk) of a Google Data proc cluster that we should setup to achieve maximum resource utilization?

Thanks!

tpoterba · March 10, 2020, 6:15pm

Hail probably isn’t the best tool for VCF=>VCF subsetting. Running 1056 Hail jobs here will see poor cluster utilization because constructing the final VCF file is a single-threaded task (merging shards written in parallel).

If you wanted to create partitioned MatrixTable files, that would be parallelizable, but not VCFs.

ruchim · March 10, 2020, 8:47pm

thanks for that explanation Tim! I think out of expediency this is our choice, and probably worth keeping in mind for future planning. For something like splitting the VCFs – what kind of machine sizes would be useful?

tpoterba · March 10, 2020, 8:51pm

Okay, sure, I totally understand using one tool makes things easier even if it’s not the best for each individual job. I think machine size won’t matter, and probably a ~100 core cluster would be fine for this. Using some kind of autoscaling feature (like on Google dataproc) will help a lot, so that you don’t pay as much for idle workers.

Topic		Replies	Views
Subset large vcf into multiple vcfs Hail Query & hailctl	7	967	February 27, 2020
How to split a huge VCF into chunks of 1000 variants Hail Query & hailctl	3	515	January 27, 2021
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	753	May 7, 2020
Fail write it in Hail format after loading a ~1Tb bgzipped VCF Hail Query & hailctl	6	785	February 14, 2019
How to select appropirate cluster specs for hail Hail Query & hailctl	1	371	October 24, 2022

Cluster Size for Subsetting in Hail

Related topics