@jkgoodrich reached out to me with this issue tonight, this is the result of my investigation so far:
On this tree:
We are attempting to run the
To get a sites QC table.
Attempting to write the result of this function fails due to executors being out of memory.
I can reliably reproduce this using the following:
- Find a bad partition where the full pipeline seemed to fail on reliably.
- Filter the
mtto just that partition using
- Run the pipeline on a 1 core worker.
The memory pressure observed here is bad enough that it remains an issue even on highmem machines, however, the partition in question does finish computing if its executor has excess capacity.
I’ve attached a log here, unfortunately either our discourse or discourse in general will not accept compressed files, so this log has been base64 encoded and can be read by
base64 -d qc_annotations_partial.log.xz.txt | xzless
qc_annotations_partial.log.xz.txt (3.5 MB)