Hail/Apache Spark Not Scaling by Cluster Size

Hi- I’ve been working on the AoU Terra environment and reported this to the AoU data science team, but I’m not sure what the relationship with the Hail team is, so I’ll make a post here as well. I ran the following series of commands on the following Hail MatrixTable
mt=hl.read_matrix_table(mt_exome_path)
It is very large dim=(34807589, 245394) and so ran the following code block where aa_ids is a list of samples and interval_table is a bed file of multiple non-contiguous genes:

filtered_mt_aa=mt.filter_cols(aa_ids.contains(mt["s"]))
filtered_mt_aa_cancer = filtered_mt_aa.filter_rows(hl.is_defined(interval_table[filtered_mt_aa.locus]))
filtered_mt_aa_cancer = filtered_mt_aa_cancer.select_rows("variant_qc")
filtered_mt_aa_cancer = filtered_mt_aa_cancer.select_entries("GT")

Previously, this would scale with cluster size so a cluster of 100 workers with 200 premptibles, each with 4 cores and 15GB memory would finish in ~30 minutes. There are ~86k partitions. About 50 tasks were completing per second, so whatever that speed is was approximate wallclock time per task. This was on Hail 0.2.107 running on Spark 3.1.3.

Now, the same job on Hail 0.2.126 running on Spark 3.3.0 is taking well over 4-5 hours for the same calculation (never fnished running the calculation because I kept running out of money). Speed of calculations had reduced to about 4-5 per second. I set the cluster size to 4 workers with 0 premptibles at 8 cores and 30GB to see that even with a significant slowdown in the number of tasks completed whether the slowdown would be significantly smaller than linear and save me compute cost at the expense of wallclock time. There was no slowdown in speed at all- in fact, there seems to be a mild increase in speed, which I am happy about but this does pose a significant problem to analyzing any dataset of considerable size. As another note, at 4 workers * 8 cores, I would expect the max # of active tasks according to the Spark progress bar to be 32; however, I would get active tasks in the hundreds and sometimes even the 1000s.

I tried altering the order of commands to subset interval and complete that operation before subsetting by sample and have had no change in speed. I’m not sure if this is a Spark or Hail issue. I’m happy to try to share more logs. However, I have lost access to all the logs on the previous versions of Hail, so I can only share the logs from Hail 0.2.126.

Thank you!

Experiencing the same issue. I’d also appreciate some suggestions from Hail team.

Hi @Hersh_Gupta ,

Sorry to hear you’re running into trouble!

Do I understand correctly that the issue you see is that you have a cluster with 1200 [1] cores, the Spark progress bar showed (N_COMPLETE+1200)/N_PARTITIONS (indicating 1200 active partitions), and approximately 50 partitions completed per second (indicating 1200/50 = 24 seconds per partition)?

In Hail 0.2.126, do I understand correctly that you had 32 cores, a progress bar like (N+1000)/M, and a completion rate of 5 per second. Is that correct?

The Spark progress bar is controlled by Spark. If it’s showing +1000 either Spark has a bug or your tasks are completing so fast that the Spark driver cannot update the progress bar fast enough. This can also happen when the jobs just rapidly fail due to memory exhaustion. There was an issue in recent versions of Google Dataproc where Dataproc started using more memory than we anticipated so we had to reduce the amount of memory allocated to Hail. See Hail issue 13960 for details. AFAIK, changes were made to the RWB to avoid this memory exhaustion issue.


Some general comments:

It is challenging to diagnose without the full script you ran and the number of partitions (.n_partitions()) in every table and matrix table. We lack access to the AoU data, so it’s also challenging for us to reproduce.

If you changed the data at all between these versions that could affect the speed in unexpected ways. For example, if the number of partitions in the interval table changed substantially. [2]

Can you reproduce this issue on data and a script that you can share with us?


[1] This is a tangent, but you should essentially never use a mixed cluster. If you’re “shuffling” (sorting) data (i.e. key_rows_by, key_by, order_by, group_rows_by(…).aggregate), you must us a cluster of exclusively non-spot instances. If you’re not shuffling, there is no benefit to non-spot instances. They just cost more. A corollary of this is that you should not use HDFS under any circumstances, always use Google Cloud Storage, including for the Hail temporary directory hl.init(..., tmp_dir=...).

[2] In general, unless you have 10s of 1000s of intervals, I’d recommend using hl.filter_intervals. It is much faster because it instructs Hail to not even read unnecessary data.