Questions about optimizing Hail and Spark configs and estimating resources and runtimes

01011 · December 1, 2022, 5:25pm

Wow, thank you @danking ! This is extremely helpful.

I have a couple of more follow-up questions based on your reply (not numbered in the same order as above):

Interesting! So if I have 100K samples with 1 million variants, and a cluster with 100 vCPUs that is able to handle it, then if I move up to 2 million variants, I simply need 200 vCPUs for the analysis to finish just as fast? And if I bump this to 400 vCPUs, it won’t necessarily run 4x faster, right?
I am still confused about increasing the total number of nodes vs increasing CPUs in each node! For example, is there a difference between the following:

10 workers each 32GB RAM 8 CPUs
5 workers each 64GB RAM 16 CPUs

Are there scenarios where one is preferred over the other?

In other words, when should I double the number of nodes vs double the number of cores in each node?

If I have a cluster of 100 vCPUs and my analysis has X variants, then same dataset but with X*3 variants will take 3 times as much time on the same 100 vCPUs?
(Regarding Q7 above) - I didn’t mean re-partition, but when doing import_vcf I can set n_partitions, right? Is it better if I try to tune this parameter?
I read here, “Spark really performs badly when executors are larger than ~4 cores” – is that still the case? Should I touch spark.executor.cores parameter at all? Or is the default=1 still fine?
Should workers always have 4GB memory per core? Is 2GB per core a bad setup? How about 8GB per core?
Should I always use SSD workers? Or is HDD just as good for HAIL?
I’m not really sure how I can tune the block_size parameter. Could you please help me understand it or point me to the correct documentation to understand what impacts this parameter?
(Regarding Q5 above) How about driver and executor memory parameters? Can I really leave these as default and let HAIL do its thing even on very large datasets? I’m not really sure when I have to increase executor/driver memory in Spark. When I move beyond 10K samples, or 100K, or 500K?
Are there any known ‘optimal’ memory amounts for 10K, 100K, 1M samples? And likewise, ‘optimal’ number of CPUs/nodes for 10K, 100K, 1M variants?

Thank you very much again Dan, this has been extremely helpful for a beginner like me.

Topic		Replies	Views
How to select appropirate cluster specs for hail Hail Query & hailctl	1	360	October 24, 2022
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1236	February 3, 2021
Hail having difficulty scaling to 400K Development	2	32	June 9, 2025
Hardware requirements Hail Query & hailctl	5	426	October 25, 2020
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	214	February 21, 2024