I am new to Hail and Spark and I’m struggling with some of the optimization concepts and configs.
I expect some of the runs on the cloud to be extremely expensive (with hundreds of thousands of samples and millions of variants), so I want to reach out for some help to make sure I don’t unnecessarily choose very powerful workers and end up paying thousands of dollars or choose very underpowered workers and end up with failed runs after tens of hours and still accruing costs.
I’m working with pVCFs (human data). Could you please help me understand the following? I’m really having a hard time understanding the relation between all these parameters and files.
- Is there a relation between the number of samples/variants and memory/CPU usage and number of required nodes? For example if I go from 100,000 samples to 500,000 samples, do I necessarily need 5 times more memory and/or CPU cores?
- Does a higher number of nodes necessarily mean faster analyses?
- Do you recommend NOT going beyond a certain node number? For example, is there a top limit for the number of nodes which is known that going beyond that number wouldn’t really help in terms of speed?
- Is there any way at all to determine the optimal specs for the worker nodes? Does this depend on the file sizes or number of samples/variants?
- Is there a need or any benefit to optimizing Spark parameters such as openCostInBytes and maxPartitionBytes? How about maxFailures? If so, are there any formulas to determine how these should be exactly tuned?
- Are there any Spark parameters that you would suggest we pay special attention to?
- Do you suggest changing the number of partitions? Or would the defaults usually work well?
- Is there a way to determine the optimal number of nodes?
Hey @01011 !
- Yes, holding the pipeline / analysis constant, they change as follows.
- Hail Tables and Matrix Tables are partitioned along the rows (aka the variants), so, all else equal, increasing the number of variants increases the number of partitions which increases the core-hours needed to perform the analysis. If you keep the cluster size constant, twice as many variants should take roughly twice as much time and cost roughly twice as much. On the other hand, if you double the cluster size, your analysis should take roughly the same time and cost roughly twice as much.
- Hail Tables and Matrix Tables are not partitioned along the columns (aka sample) axis. Moreover, Hail, generally, needs to load a whole row and all the associated computed data into memory. For example, suppose you are computing 50 statistics for each sample. If you double the number of samples, you’ll need enough memory to hold all twice as many statistics in memory. In practice, most folks don’t memory-optimize their pipelines enough that this is a concern. I suggest trying your analysis on a small number of variants. If you find that you run out of memory, try doubling the amount of memory and try again.
- No. If the cores do not have partitions to work on, the analysis cannot move any faster. We recommend no fewer than three partitions per core in an analysis.
- I think in practice we’ve found issues with spark around 3k-4k machines.
- Just use spot/preemptible n1-standard-8s unless you get out of memory errors or shuffle failures. If you get memory errors use more RAM. If you get shuffle failures, use non-spot/non-preemptible machines.
- We don’t fiddle with those parameters very much. Hail uses a very thin part of the Spark API and I think, in practice, these parameters don’t have a ton of impact on Hail.
- Are you using
hailctl dataproc start? If not, you might consider setting some of the options we set in dataproc.
- We generally recommend a partition size around 128MB. Note that repartitioning is an expensive operation that can’t be run on spot/preemptible machines. I suspect whatever
import_vcf did is fine.
- No. Use a cluster autoscaler.
A few more things:
- Do not use HDFS for anything. Use S3/GCS/ABS. Make sure to set your
hl.init to a cloud storage path.
- Remember that you’re charged for core-hours and Hail is quite efficient at scaling. If you double the number of cores you’re using, you’ll pay roughly the same amount but finish your analysis in half the time.
- Never use data directly from an
import_* . Import the data, write to Hail native format, then read the Hail native format and use that.
- Test your pipeline on a subset of variants until you’re confident the pipeline is correct and runs as quickly as you expect. You can use this subset of variants to predict the runtime on the full dataset.
- You might find my “how to cloud with hail” docs useful.
 The only exception to this rule is BGEN. BGEN is a compact and efficient to import format. If you were to write a BGEN to a Matrix Table, you would at least double the size of your dataset due to how Hail represents genotypes.
Wow, thank you @danking ! This is extremely helpful.
I have a couple of more follow-up questions based on your reply (not numbered in the same order as above):
Interesting! So if I have 100K samples with 1 million variants, and a cluster with 100 vCPUs that is able to handle it, then if I move up to 2 million variants, I simply need 200 vCPUs for the analysis to finish just as fast? And if I bump this to 400 vCPUs, it won’t necessarily run 4x faster, right?
I am still confused about increasing the total number of nodes vs increasing CPUs in each node! For example, is there a difference between the following:
10 workers each 32GB RAM 8 CPUs
5 workers each 64GB RAM 16 CPUs
Are there scenarios where one is preferred over the other?
In other words, when should I double the number of nodes vs double the number of cores in each node?
If I have a cluster of 100 vCPUs and my analysis has X variants, then same dataset but with X*3 variants will take 3 times as much time on the same 100 vCPUs?
(Regarding Q7 above) - I didn’t mean re-partition, but when doing import_vcf I can set n_partitions, right? Is it better if I try to tune this parameter?
I read here, “Spark really performs badly when executors are larger than ~4 cores” – is that still the case? Should I touch
spark.executor.cores parameter at all? Or is the default=1 still fine?
Should workers always have 4GB memory per core? Is 2GB per core a bad setup? How about 8GB per core?
Should I always use SSD workers? Or is HDD just as good for HAIL?
I’m not really sure how I can tune the block_size parameter. Could you please help me understand it or point me to the correct documentation to understand what impacts this parameter?
(Regarding Q5 above) How about driver and executor memory parameters? Can I really leave these as default and let HAIL do its thing even on very large datasets? I’m not really sure when I have to increase executor/driver memory in Spark. When I move beyond 10K samples, or 100K, or 500K?
Are there any known ‘optimal’ memory amounts for 10K, 100K, 1M samples? And likewise, ‘optimal’ number of CPUs/nodes for 10K, 100K, 1M variants?
Thank you very much again Dan, this has been extremely helpful for a beginner like me.