How to select appropirate cluster specs for hail

igorm · October 24, 2022, 2:33pm

Hi,

Can you please provide some guidance how to think about / design the appropriate cluster for hail?

Example: 10k single sample vcf files (~100M variants) imported to MatrixTable.

In oder to efficiently conduct aggregation queries (speed, avoiding running out of memory etc…) on such dataset what kind of cluster would be a good choice in terms of:

specs for master node
specs for slave node
number of slave nodes

Thanks!

danking · October 24, 2022, 8:02pm

We generally recommend using whatever autoscaling is provided by your cloud of choice. Import your data into a matrix table and save it in that format (never import_vcf and then immediately do analysis). Use spot or preemptible workers unless your pipeline has a “shuffle” (basically: key_by and key_rows_by). Use a leader/master node with ~16 cores and ~60 GB of RAM. Worker nodes can generally be whatever the standard instance type is. Some operations take a block_size parameter which you can set to smaller values if you run into RAM problems.

Topic		Replies	Views
Cluster Size for Subsetting in Hail Hail Query & hailctl	3	399	March 10, 2020
What google cluster parameters for moderate scale wgs work? Hail Query & hailctl	1	585	May 28, 2019
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	753	May 7, 2020
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1188	December 1, 2022
Improve matrix write time? Hail Query & hailctl	19	793	October 29, 2019

How to select appropirate cluster specs for hail

Related topics