We scraped everything using gcloud dataproc clusters diagnose
and it seemed that results were being put in HDFS.
oh, totally, it’s the aggs per partition: https://github.com/hail-is/hail/blob/10e525a56787dd00b6aaed62d856c1eabd3c0f3a/hail/src/main/scala/is/hail/expr/ir/TableIR.scala#L1666-L1674
What was the cluster config? non_preempts, preempts?
It’s 300 workers all with lots of disk, we’re not running out. I suggested lower workers with more cores and that failed too.
Update here to not forget. The resource we are running out of in the latest bug is definitely memory . I’ve tracked the change that introduced the regression to https://github.com/hail-is/hail/pull/8794 and we are continuing to investigate.
1 Like
This PR fixes the problem:
1 Like