Hey @tpoterba thank you for looking over this so quickly yesterday! I’m still struggling with this one. I checkpointed the MT right before the key_rows_by (after annotating the MT with the new_locus_alleles info) and am trying to checkpoint the MT right after that line using all workers.
So I’m worried I’m doing something wrong, and wondering if I need to restart it with a different cluster configuration or if I should just let it keep running. I started with the default hailctl dataproc cluster configuration and resized to 80 workers before running the above lines.
After discussion with Tim I added some code he wrote to split up the key_rows_by into two parts, one for the rows that have a locus change and one for the rows that don’t. Then use union_rows to combine the two MTs that have been rekeyed. If I understand correctly, most rows will not have a locus change and therefore the key_rows_by will not need to shuffle for those rows and will only need to shuffle for the rows with the locus change, so by doing this split (followed by the union_rows) you prevent a shuffle on the entire MT. This solved my problem of the hanging last 2 partitions.