I’m maintaining a Hail-related template notebook in the Broad’s Terra environment in a Jupyter notebook context on a Spark cluster as part of one the Biodata Catalyst grant. I am wondering what the best way to optimize for this kind of environment.
Previously, we told users to use the following configuration when running on Freeze 5b data, n=1000:
Master node: 8 CPUs, 30 GB mem, 500 GB disk space
Workers: 170 workers of which 50 are preemptible, each having 8 CPUs, 30 GB mem, and 100 GB disk space.
Now we’re running on Freeze 8 data, which has a truckload more variants. I am testing on a n=1111 cohort of genetically similar individuals, but likely due to the change to Freeze 8 the previous configuration appears inadequate. If I run only on chr 1 then I can complete the notebook with that compute in just under 50 minutes, but on all chromosomes, it appears to be taking too long or a preemptible is getting stuck. Avoiding preemptibles is an option, but it doesn’t fix the underlying issue – if at all possible we would like the notebook to be completable by users in a few hours tops for n=1000ish on Freeze 8 data, because compute will continue costing money after completion of all tasks in the notebook. Terra has a feature where it will “pause” the compute after about half an hour of no activity, but once you’re requesting this amount of resources, you are still spending about $20/hour when the compute is paused. In other words it can’t take so long that the user goes to bed and wakes up to a bunch of unnecessary extra charges.
The specific tasks being performed are related to a GWAS. The GWAS itself is performed outside of the notebook, this is the preparation. Essentially, the tasks are:
- Loading in VCF data and creating a hail matrix (mt)
- mt.count()
- Merging the hail table with a pandas dataframe containing phenotypic data
- Filtering for common variants (MAF > 0.01) via variant_qc()
- hl.ld_prune(), which tends to take the longest
- Perform a normalized PCA
- Calculate a GRM
I am already in touch with several Terra engineers but I was recommended to take this question to the Hail community, as I have heard the setup of Terra is not entirely unlike common setups for Hail.