Hey @tpoterba thank you for looking over this so quickly yesterday! I’m still struggling with this one. I checkpointed the MT right before the
key_rows_by (after annotating the MT with the
new_locus_alleles info) and am trying to checkpoint the MT right after that line using all workers.
mt = hl.read_matrix_table("gs://gnomad-tmp/v3.1.2/intermediate_hgdp_tgp_subset_sparse_before_row_key.mt")
mt = mt.key_rows_by(
mt = mt.checkpoint("gs://gnomad-tmp/v3.1.2/intermediate_hgdp_tgp_subset_sparse_after_row_key.mt", overwrite=True)
However, it is hanging for a while at the last 2 partitions:
2021-09-23 05:40:46 Hail: INFO: Ordering unsorted dataset with network shuffle6]
[Stage 2:==================================================>(70862 + 2) / 70863]
So I’m worried I’m doing something wrong, and wondering if I need to restart it with a different cluster configuration or if I should just let it keep running. I started with the default hailctl dataproc cluster configuration and resized to 80 workers before running the above lines.