Hail 0.2 - method group_rows_by requieres lot of space on disk

enriquea · May 23, 2018, 9:33am

Hi everyone,

Trying to aggregate a MatrixTable (~100 GB on disk) by rows (e.g. gene and variant consequences), I’ve observed that Hail (Spark behind the scenes) requires until 5X disk space to do the job.

Running some tests, the final aggregated table looks well, however, I was wondering if this behaviour is normal (I will need to scale the analysis at some point).

How could I control the shuffle write/read process?

Thanks

tpoterba · May 23, 2018, 1:04pm

Hi Enrique,
I think you’ve definitely encountered a problem with how we shuffle data in group_rows_by/aggregate. We’re in the middle of a huge infrastructure redesign, and I think this will be fixed naturally in the next few weeks.

I’ve made an issue to track it - https://github.com/hail-is/hail/issues/3641

Thanks for reporting this problem!!

tpoterba · May 23, 2018, 5:54pm

Hi Enrique,
I was wrong about the root cause of the problem, it’s due to an introduced inefficiency in this particular algorithm. I’ll update the issue to reflect it.

enriquea · May 24, 2018, 9:00am

Hi @tpoterba,

Thanks for covering this issue. I’ll be waiting for the update!

Topic		Replies	Views
Group_rows_by error: SparkException: Failed to get broadcast_1_piece0 of broadcast_1 Hail Query & hailctl	3	1966	November 2, 2020
Trouble saving a large MatrixTable Hail Query & hailctl	6	120	October 30, 2024
Preventing a shuffle error Hail Query & hailctl	5	393	September 23, 2021
Hail MT Directory Size Hail Query & hailctl	9	385	December 7, 2022
Spark Tweak - to_pandas function Hail Query & hailctl	39	2485	October 23, 2018

Hail 0.2 - method group_rows_by requieres lot of space on disk

Related topics