Questions about optimizing Hail and Spark configs and estimating resources and runtimes

Hey @01011 !

Great questions!

  1. Yes, holding the pipeline / analysis constant, they change as follows.
    1. Hail Tables and Matrix Tables are partitioned along the rows (aka the variants), so, all else equal, increasing the number of variants increases the number of partitions which increases the core-hours needed to perform the analysis. If you keep the cluster size constant, twice as many variants should take roughly twice as much time and cost roughly twice as much. On the other hand, if you double the cluster size, your analysis should take roughly the same time and cost roughly twice as much.
    2. Hail Tables and Matrix Tables are not partitioned along the columns (aka sample) axis. Moreover, Hail, generally, needs to load a whole row and all the associated computed data into memory. For example, suppose you are computing 50 statistics for each sample. If you double the number of samples, you’ll need enough memory to hold all twice as many statistics in memory. In practice, most folks don’t memory-optimize their pipelines enough that this is a concern. I suggest trying your analysis on a small number of variants. If you find that you run out of memory, try doubling the amount of memory and try again.
  2. No. If the cores do not have partitions to work on, the analysis cannot move any faster. We recommend no fewer than three partitions per core in an analysis.
  3. I think in practice we’ve found issues with spark around 3k-4k machines.
  4. Just use spot/preemptible n1-standard-8s unless you get out of memory errors or shuffle failures. If you get memory errors use more RAM. If you get shuffle failures, use non-spot/non-preemptible machines.
  5. We don’t fiddle with those parameters very much. Hail uses a very thin part of the Spark API and I think, in practice, these parameters don’t have a ton of impact on Hail.
  6. Are you using hailctl dataproc start? If not, you might consider setting some of the options we set in dataproc.
  7. We generally recommend a partition size around 128MB. Note that repartitioning is an expensive operation that can’t be run on spot/preemptible machines. I suspect whatever import_vcf did is fine.
  8. No. Use a cluster autoscaler.

A few more things:

  1. Do not use HDFS for anything. Use S3/GCS/ABS. Make sure to set your tmp_dir in hl.init to a cloud storage path.
  2. Remember that you’re charged for core-hours and Hail is quite efficient at scaling. If you double the number of cores you’re using, you’ll pay roughly the same amount but finish your analysis in half the time.
  3. Never use data directly from an import_* [1]. Import the data, write to Hail native format, then read the Hail native format and use that.
  4. Test your pipeline on a subset of variants until you’re confident the pipeline is correct and runs as quickly as you expect. You can use this subset of variants to predict the runtime on the full dataset.
  5. You might find my “how to cloud with hail” docs useful.

[1] The only exception to this rule is BGEN. BGEN is a compact and efficient to import format. If you were to write a BGEN to a Matrix Table, you would at least double the size of your dataset due to how Hail represents genotypes.

1 Like