Running Hail on Terra -- how should I optimize?

I’m maintaining a Hail-related template notebook in the Broad’s Terra environment in a Jupyter notebook context on a Spark cluster as part of one the Biodata Catalyst grant. I am wondering what the best way to optimize for this kind of environment.

Previously, we told users to use the following configuration when running on Freeze 5b data, n=1000:

Master node: 8 CPUs, 30 GB mem, 500 GB disk space
Workers: 170 workers of which 50 are preemptible, each having 8 CPUs, 30 GB mem, and 100 GB disk space.

Now we’re running on Freeze 8 data, which has a truckload more variants. I am testing on a n=1111 cohort of genetically similar individuals, but likely due to the change to Freeze 8 the previous configuration appears inadequate. If I run only on chr 1 then I can complete the notebook with that compute in just under 50 minutes, but on all chromosomes, it appears to be taking too long or a preemptible is getting stuck. Avoiding preemptibles is an option, but it doesn’t fix the underlying issue – if at all possible we would like the notebook to be completable by users in a few hours tops for n=1000ish on Freeze 8 data, because compute will continue costing money after completion of all tasks in the notebook. Terra has a feature where it will “pause” the compute after about half an hour of no activity, but once you’re requesting this amount of resources, you are still spending about $20/hour when the compute is paused. In other words it can’t take so long that the user goes to bed and wakes up to a bunch of unnecessary extra charges.

The specific tasks being performed are related to a GWAS. The GWAS itself is performed outside of the notebook, this is the preparation. Essentially, the tasks are:

  • Loading in VCF data and creating a hail matrix (mt)
  • mt.count()
  • Merging the hail table with a pandas dataframe containing phenotypic data
  • Filtering for common variants (MAF > 0.01) via variant_qc()
  • hl.ld_prune(), which tends to take the longest
  • Perform a normalized PCA
  • Calculate a GRM

I am already in touch with several Terra engineers but I was recommended to take this question to the Hail community, as I have heard the setup of Terra is not entirely unlike common setups for Hail.

Hey @aofarrel !

Sorry you’re having performance headaches and high cost! I’m sure we can address this. Can you point us at the notebook? It’s easier for me to make suggestions on code.


Generally speaking, you don’t want to avoid preemptibles. You actually want as many preemptibles as possible for as long as possible! Hail power users often start a cluster, submit a Hail pipeline that does all the preemptible-safe computation and writes out the result, then re-configure the cluster for a subsequent pipeline that is not preemptible-safe. I wrote up some general notes here, but I’m happy to make specific comments on the notebook in question.


Looking quickly at the tasks, I’d structure this as:

  1. Import VCF write to MT. We’ll never touch the VCF again.
  2. Load MT, annotate with phenotypic data, calculate statistics, and write out just the rows of this table (much less data!). For example,
mt = mt.filter_rows(passes_qc = (mt.variant_qc.AF[0] > 0.2) & (mt.call_rate > 0.9) & ...)
rows = mt.rows()
rows = rows.select()  # keep just the locus and alleles (the key fields)
rows.write('gs://.../qc_passing_variants')  # the write triggers all the QC calculations
  1. Now, for each of ld_prune, PCA, and GRM, I’d tune the cluster size, number of partitions, and preemptible to non-preemptible ratio to be successful for that operation. For each operation, I would freshly load and filter the matrix table. Once I have a set of ld_pruned variants, I’d write that out as another row table, to be used for filtering.

I’m pretty sure GRM is preemptible-safe if you set the hail temporary directory to a GCS directory (e.g. hl.init(tmp_dir='gs://my-temporary-bucket/')).

ld_prune is a notoriously difficult operation. I’ll ask someone with more background to comment here about that operation.


Yeah, babysitting pipelines is a pain. I’d ask gnomAD team how they deal with this. I’m sure they have good tactics. I personally run expensive jobs in a script where the cluster shutdown runs immediately after the submit finished—successful or not. Is there a way to replicate this behavior in Terra?

gcloud dataproc create ...
gcloud dataproc submit ...
gcloud dataproc stop ...