How to select appropirate cluster specs for hail

Hi,

Can you please provide some guidance how to think about / design the appropriate cluster for hail?

Example: 10k single sample vcf files (~100M variants) imported to MatrixTable.

In oder to efficiently conduct aggregation queries (speed, avoiding running out of memory etc…) on such dataset what kind of cluster would be a good choice in terms of:

  • specs for master node
  • specs for slave node
  • number of slave nodes

Thanks!

We generally recommend using whatever autoscaling is provided by your cloud of choice. Import your data into a matrix table and save it in that format (never import_vcf and then immediately do analysis). Use spot or preemptible workers unless your pipeline has a “shuffle” (basically: key_by and key_rows_by). Use a leader/master node with ~16 cores and ~60 GB of RAM. Worker nodes can generally be whatever the standard instance type is. Some operations take a block_size parameter which you can set to smaller values if you run into RAM problems.