Trying to aggregate a MatrixTable (~100 GB on disk) by rows (e.g. gene and variant consequences), I’ve observed that Hail (Spark behind the scenes) requires until 5X disk space to do the job.
Running some tests, the final aggregated table looks well, however, I was wondering if this behaviour is normal (I will need to scale the analysis at some point).
How could I control the shuffle write/read process?
I think you’ve definitely encountered a problem with how we shuffle data in group_rows_by/aggregate. We’re in the middle of a huge infrastructure redesign, and I think this will be fixed naturally in the next few weeks.
I’ve made an issue to track it - https://github.com/hail-is/hail/issues/3641
Thanks for reporting this problem!!
I was wrong about the root cause of the problem, it’s due to an introduced inefficiency in this particular algorithm. I’ll update the issue to reflect it.
Thanks for covering this issue. I’ll be waiting for the update!