Hi- I’ve been working on the AoU Terra environment and reported this to the AoU data science team, but I’m not sure what the relationship with the Hail team is, so I’ll make a post here as well. I ran the following series of commands on the following Hail MatrixTable
mt=hl.read_matrix_table(mt_exome_path)
It is very large dim=(34807589, 245394) and so ran the following code block where aa_ids
is a list of samples and interval_table
is a bed file of multiple non-contiguous genes:
filtered_mt_aa=mt.filter_cols(aa_ids.contains(mt["s"]))
filtered_mt_aa_cancer = filtered_mt_aa.filter_rows(hl.is_defined(interval_table[filtered_mt_aa.locus]))
filtered_mt_aa_cancer = filtered_mt_aa_cancer.select_rows("variant_qc")
filtered_mt_aa_cancer = filtered_mt_aa_cancer.select_entries("GT")
Previously, this would scale with cluster size so a cluster of 100 workers with 200 premptibles, each with 4 cores and 15GB memory would finish in ~30 minutes. There are ~86k partitions. About 50 tasks were completing per second, so whatever that speed is was approximate wallclock time per task. This was on Hail 0.2.107 running on Spark 3.1.3.
Now, the same job on Hail 0.2.126 running on Spark 3.3.0 is taking well over 4-5 hours for the same calculation (never fnished running the calculation because I kept running out of money). Speed of calculations had reduced to about 4-5 per second. I set the cluster size to 4 workers with 0 premptibles at 8 cores and 30GB to see that even with a significant slowdown in the number of tasks completed whether the slowdown would be significantly smaller than linear and save me compute cost at the expense of wallclock time. There was no slowdown in speed at all- in fact, there seems to be a mild increase in speed, which I am happy about but this does pose a significant problem to analyzing any dataset of considerable size. As another note, at 4 workers * 8 cores, I would expect the max # of active tasks according to the Spark progress bar to be 32; however, I would get active tasks in the hundreds and sometimes even the 1000s.
I tried altering the order of commands to subset interval and complete that operation before subsetting by sample and have had no change in speed. I’m not sure if this is a Spark or Hail issue. I’m happy to try to share more logs. However, I have lost access to all the logs on the previous versions of Hail, so I can only share the logs from Hail 0.2.126.
Thank you!