Recommended data node hardware for Hail

Are there recommended hardware configurations or guidelines for Hadoop data nodes for Hail 0.2?

  1. How many drives per data node
  2. Drive type (SATA
  3. Number of cores per data node
  4. Memory per core or memory per node

Given a budget, are we better off with the fastest CPUs or more data nodes with slower CPUs?

This Hadoop Cluster will be used to analyze VCFs ranging from 5k to 500K.

The Spark cluster model includes one driver and many executors. Each executor is a single JVM that may have multiple cores and can execute several tasks concurrently, but Spark really performs badly when executors are larger than ~4 cores. This means if you have a 16-core machine, it’s going to be broken into 4 totally independent share-nothing chunks. To this end, stuffing more cores/memory inside a single box won’t make things go any faster.

Right now SSDs probably aren’t worth it. Hail will probably be able to take better advantage of SSD performance in 6mos-1year.

~4G RAM per CPU seems to work on Google Dataproc.