Questions about optimizing Hail and Spark configs and estimating resources and runtimes

Wow, thank you @danking ! This is extremely helpful.

I have a couple of more follow-up questions based on your reply (not numbered in the same order as above):

  1. Interesting! So if I have 100K samples with 1 million variants, and a cluster with 100 vCPUs that is able to handle it, then if I move up to 2 million variants, I simply need 200 vCPUs for the analysis to finish just as fast? And if I bump this to 400 vCPUs, it won’t necessarily run 4x faster, right?

  2. I am still confused about increasing the total number of nodes vs increasing CPUs in each node! For example, is there a difference between the following:

10 workers each 32GB RAM 8 CPUs
5 workers each 64GB RAM 16 CPUs

Are there scenarios where one is preferred over the other?

In other words, when should I double the number of nodes vs double the number of cores in each node?

  1. If I have a cluster of 100 vCPUs and my analysis has X variants, then same dataset but with X*3 variants will take 3 times as much time on the same 100 vCPUs?

  2. (Regarding Q7 above) - I didn’t mean re-partition, but when doing import_vcf I can set n_partitions, right? Is it better if I try to tune this parameter?

  3. I read here, “Spark really performs badly when executors are larger than ~4 cores” – is that still the case? Should I touch spark.executor.cores parameter at all? Or is the default=1 still fine?

  4. Should workers always have 4GB memory per core? Is 2GB per core a bad setup? How about 8GB per core?

  5. Should I always use SSD workers? Or is HDD just as good for HAIL?

  6. I’m not really sure how I can tune the block_size parameter. Could you please help me understand it or point me to the correct documentation to understand what impacts this parameter?

  7. (Regarding Q5 above) How about driver and executor memory parameters? Can I really leave these as default and let HAIL do its thing even on very large datasets? I’m not really sure when I have to increase executor/driver memory in Spark. When I move beyond 10K samples, or 100K, or 500K?

  8. Are there any known ‘optimal’ memory amounts for 10K, 100K, 1M samples? And likewise, ‘optimal’ number of CPUs/nodes for 10K, 100K, 1M variants?

Thank you very much again Dan, this has been extremely helpful for a beginner like me.