Questions about optimizing Hail and Spark configs and estimating resources and runtimes

I am new to Hail and Spark and I’m struggling with some of the optimization concepts and configs.

I expect some of the runs on the cloud to be extremely expensive (with hundreds of thousands of samples and millions of variants), so I want to reach out for some help to make sure I don’t unnecessarily choose very powerful workers and end up paying thousands of dollars or choose very underpowered workers and end up with failed runs after tens of hours and still accruing costs.

I’m working with pVCFs (human data). Could you please help me understand the following? I’m really having a hard time understanding the relation between all these parameters and files.

  1. Is there a relation between the number of samples/variants and memory/CPU usage and number of required nodes? For example if I go from 100,000 samples to 500,000 samples, do I necessarily need 5 times more memory and/or CPU cores?
  2. Does a higher number of nodes necessarily mean faster analyses?
  3. Do you recommend NOT going beyond a certain node number? For example, is there a top limit for the number of nodes which is known that going beyond that number wouldn’t really help in terms of speed?
  4. Is there any way at all to determine the optimal specs for the worker nodes? Does this depend on the file sizes or number of samples/variants?
  5. Is there a need or any benefit to optimizing Spark parameters such as openCostInBytes and maxPartitionBytes? How about maxFailures? If so, are there any formulas to determine how these should be exactly tuned?
  6. Are there any Spark parameters that you would suggest we pay special attention to?
  7. Do you suggest changing the number of partitions? Or would the defaults usually work well?
  8. Is there a way to determine the optimal number of nodes?

Hey @01011 !

Great questions!

  1. Yes, holding the pipeline / analysis constant, they change as follows.
    1. Hail Tables and Matrix Tables are partitioned along the rows (aka the variants), so, all else equal, increasing the number of variants increases the number of partitions which increases the core-hours needed to perform the analysis. If you keep the cluster size constant, twice as many variants should take roughly twice as much time and cost roughly twice as much. On the other hand, if you double the cluster size, your analysis should take roughly the same time and cost roughly twice as much.
    2. Hail Tables and Matrix Tables are not partitioned along the columns (aka sample) axis. Moreover, Hail, generally, needs to load a whole row and all the associated computed data into memory. For example, suppose you are computing 50 statistics for each sample. If you double the number of samples, you’ll need enough memory to hold all twice as many statistics in memory. In practice, most folks don’t memory-optimize their pipelines enough that this is a concern. I suggest trying your analysis on a small number of variants. If you find that you run out of memory, try doubling the amount of memory and try again.
  2. No. If the cores do not have partitions to work on, the analysis cannot move any faster. We recommend no fewer than three partitions per core in an analysis.
  3. I think in practice we’ve found issues with spark around 3k-4k machines.
  4. Just use spot/preemptible n1-standard-8s unless you get out of memory errors or shuffle failures. If you get memory errors use more RAM. If you get shuffle failures, use non-spot/non-preemptible machines.
  5. We don’t fiddle with those parameters very much. Hail uses a very thin part of the Spark API and I think, in practice, these parameters don’t have a ton of impact on Hail.
  6. Are you using hailctl dataproc start? If not, you might consider setting some of the options we set in dataproc.
  7. We generally recommend a partition size around 128MB. Note that repartitioning is an expensive operation that can’t be run on spot/preemptible machines. I suspect whatever import_vcf did is fine.
  8. No. Use a cluster autoscaler.

A few more things:

  1. Do not use HDFS for anything. Use S3/GCS/ABS. Make sure to set your tmp_dir in hl.init to a cloud storage path.
  2. Remember that you’re charged for core-hours and Hail is quite efficient at scaling. If you double the number of cores you’re using, you’ll pay roughly the same amount but finish your analysis in half the time.
  3. Never use data directly from an import_* [1]. Import the data, write to Hail native format, then read the Hail native format and use that.
  4. Test your pipeline on a subset of variants until you’re confident the pipeline is correct and runs as quickly as you expect. You can use this subset of variants to predict the runtime on the full dataset.
  5. You might find my “how to cloud with hail” docs useful.

[1] The only exception to this rule is BGEN. BGEN is a compact and efficient to import format. If you were to write a BGEN to a Matrix Table, you would at least double the size of your dataset due to how Hail represents genotypes.

1 Like

Wow, thank you @danking ! This is extremely helpful.

I have a couple of more follow-up questions based on your reply (not numbered in the same order as above):

  1. Interesting! So if I have 100K samples with 1 million variants, and a cluster with 100 vCPUs that is able to handle it, then if I move up to 2 million variants, I simply need 200 vCPUs for the analysis to finish just as fast? And if I bump this to 400 vCPUs, it won’t necessarily run 4x faster, right?

  2. I am still confused about increasing the total number of nodes vs increasing CPUs in each node! For example, is there a difference between the following:

10 workers each 32GB RAM 8 CPUs
5 workers each 64GB RAM 16 CPUs

Are there scenarios where one is preferred over the other?

In other words, when should I double the number of nodes vs double the number of cores in each node?

  1. If I have a cluster of 100 vCPUs and my analysis has X variants, then same dataset but with X*3 variants will take 3 times as much time on the same 100 vCPUs?

  2. (Regarding Q7 above) - I didn’t mean re-partition, but when doing import_vcf I can set n_partitions, right? Is it better if I try to tune this parameter?

  3. I read here, “Spark really performs badly when executors are larger than ~4 cores” – is that still the case? Should I touch spark.executor.cores parameter at all? Or is the default=1 still fine?

  4. Should workers always have 4GB memory per core? Is 2GB per core a bad setup? How about 8GB per core?

  5. Should I always use SSD workers? Or is HDD just as good for HAIL?

  6. I’m not really sure how I can tune the block_size parameter. Could you please help me understand it or point me to the correct documentation to understand what impacts this parameter?

  7. (Regarding Q5 above) How about driver and executor memory parameters? Can I really leave these as default and let HAIL do its thing even on very large datasets? I’m not really sure when I have to increase executor/driver memory in Spark. When I move beyond 10K samples, or 100K, or 500K?

  8. Are there any known ‘optimal’ memory amounts for 10K, 100K, 1M samples? And likewise, ‘optimal’ number of CPUs/nodes for 10K, 100K, 1M variants?

Thank you very much again Dan, this has been extremely helpful for a beginner like me.

  1. Yeah this is roughly right. There are certain operations that are not linear in variants though. For example, ld prune is quadratic in variants. PC-Relate is quadratic in samples. But that just means you’ll need to square the number of cores rather than double them if you double the samples. Obviously, at some point, you reach the limits of what the cloud can provide to you. We can’t scale forever, hah.

  2. There are subtle differences between these but I don’t think it matters for your purposes. I recommend controlling the number of nodes, keep the core count fixed at 8 CPU since that’s what we test against.

  3. Yeah this should roughly be true for sufficiently large X.

  4. I believe import_vcf does an OK job by default.

  5. Your cloud provider should set an executor to core ratio that works well in their environment. I’m not certain that Tim’s comment from 2018 still applies.

  6. We test with ~4GB per core. I don’t know about ~2GB per core. If you notice your VMs running out of memory you can try ~8GB per core. In particular, if you have a very large number of samples, you might start to push memory limits because Hail loads all the samples for a given variant into memory at a time.

  7. I don’t have a strong opinion here. Hail stores data in blob storage, not on local hard drives. The hard drives are only used for Spark shuffles. We haven’t benchmarked the effect of HDD vs SDD for Spark shuffles.

  8. This is for pc_relate and other linear algebra methods, right? We default to 4096 but that can sometimes cause OOM issues. I think 2048 is almost always a good choice for block_size. This parameter dictates the blocking of Hail’s Block Matrix. 2048 means Hail will break a matrix into blocks of 2048x2048 64-bit floating point numbers. To compute a matrix multiply we need to read one block from the left matrix, one from the right matrix, and we need to keep a running sum block. So you need at least 3 * block_size * block_size * 8 bytes. In practice, Hail wastes memory and so we can’t set block_size to, say, 8K. I recommend 2048 or 4096.

  9. The driver is usually a small percentage of total cluster cost, so I just set it to a 16 core high memory machine and don’t think too much about it. Driver resources are proportional to both variants and samples.

  10. We don’t have detailed benchmarks for any of these. We generally recommend 128 MiB partitions, 8-core nodes, 4GB per core, spot / preemptible nodes, and autoscaling clusters. That should get you relatively good and cheap performance.

1 Like