Can you please provide some guidance how to think about / design the appropriate cluster for hail?
Example: 10k single sample vcf files (~100M variants) imported to MatrixTable.
In oder to efficiently conduct aggregation queries (speed, avoiding running out of memory etc…) on such dataset what kind of cluster would be a good choice in terms of:
- specs for master node
- specs for slave node
- number of slave nodes
We generally recommend using whatever autoscaling is provided by your cloud of choice. Import your data into a matrix table and save it in that format (never
import_vcf and then immediately do analysis). Use spot or preemptible workers unless your pipeline has a “shuffle” (basically:
key_rows_by). Use a leader/master node with ~16 cores and ~60 GB of RAM. Worker nodes can generally be whatever the standard instance type is. Some operations take a
block_size parameter which you can set to smaller values if you run into RAM problems.