Wow, thank you @danking ! This is extremely helpful.
I have a couple of more follow-up questions based on your reply (not numbered in the same order as above):
-
Interesting! So if I have 100K samples with 1 million variants, and a cluster with 100 vCPUs that is able to handle it, then if I move up to 2 million variants, I simply need 200 vCPUs for the analysis to finish just as fast? And if I bump this to 400 vCPUs, it won’t necessarily run 4x faster, right?
-
I am still confused about increasing the total number of nodes vs increasing CPUs in each node! For example, is there a difference between the following:
10 workers each 32GB RAM 8 CPUs
5 workers each 64GB RAM 16 CPUs
Are there scenarios where one is preferred over the other?
In other words, when should I double the number of nodes vs double the number of cores in each node?
-
If I have a cluster of 100 vCPUs and my analysis has X variants, then same dataset but with X*3 variants will take 3 times as much time on the same 100 vCPUs?
-
(Regarding Q7 above) - I didn’t mean re-partition, but when doing import_vcf I can set n_partitions, right? Is it better if I try to tune this parameter?
-
I read here, “Spark really performs badly when executors are larger than ~4 cores” – is that still the case? Should I touch
spark.executor.cores
parameter at all? Or is the default=1 still fine? -
Should workers always have 4GB memory per core? Is 2GB per core a bad setup? How about 8GB per core?
-
Should I always use SSD workers? Or is HDD just as good for HAIL?
-
I’m not really sure how I can tune the block_size parameter. Could you please help me understand it or point me to the correct documentation to understand what impacts this parameter?
-
(Regarding Q5 above) How about driver and executor memory parameters? Can I really leave these as default and let HAIL do its thing even on very large datasets? I’m not really sure when I have to increase executor/driver memory in Spark. When I move beyond 10K samples, or 100K, or 500K?
-
Are there any known ‘optimal’ memory amounts for 10K, 100K, 1M samples? And likewise, ‘optimal’ number of CPUs/nodes for 10K, 100K, 1M variants?
Thank you very much again Dan, this has been extremely helpful for a beginner like me.