Cluster Size for Subsetting in Hail

We’re trying to figure out the size of a cluster to use for subsetting a few VCFs, details below.

There are 22 *.vcf.bgz files, a total of about ~350GiB
They need to be partitioned by 48 sample lists – so creating a total of 22 * 48, outputting a total of 1056 vcf files.

We’re thinking to loop over each VCF and create the subsets – what is the size (cpu/mem/disk) of a Google Data proc cluster that we should setup to achieve maximum resource utilization?


Hail probably isn’t the best tool for VCF=>VCF subsetting. Running 1056 Hail jobs here will see poor cluster utilization because constructing the final VCF file is a single-threaded task (merging shards written in parallel).

If you wanted to create partitioned MatrixTable files, that would be parallelizable, but not VCFs.

thanks for that explanation Tim! I think out of expediency this is our choice, and probably worth keeping in mind for future planning. For something like splitting the VCFs – what kind of machine sizes would be useful?

Okay, sure, I totally understand using one tool makes things easier even if it’s not the best for each individual job. I think machine size won’t matter, and probably a ~100 core cluster would be fine for this. Using some kind of autoscaling feature (like on Google dataproc) will help a lot, so that you don’t pay as much for idle workers.