Out of memory in Allele Specific and Site Level INFO calculation on a sparse matrix table

@jkgoodrich reached out to me with this issue tonight, this is the result of my investigation so far:

On this tree:

We are attempting to run the compute_info function:

To get a sites QC table.

Attempting to write the result of this function fails due to executors being out of memory.

I can reliably reproduce this using the following:

  1. Find a bad partition where the full pipeline seemed to fail on reliably.
  2. Filter the mt to just that partition using _filter_partitions
  3. Run the pipeline on a 1 core worker.

The memory pressure observed here is bad enough that it remains an issue even on highmem machines, however, the partition in question does finish computing if its executor has excess capacity.

I’ve attached a log here, unfortunately either our discourse or discourse in general will not accept compressed files, so this log has been base64 encoded and can be read by base64 -d qc_annotations_partial.log.xz.txt | xzless

qc_annotations_partial.log.xz.txt (3.5 MB)

fixed by: