Table file sizes are different after checkpoint/write

Hi hail team!

I have a basic question about Hail Table file sizes. I had a Table with >100K partitions (most of them were empty), so I repartitioned the Table on read, checkpointed, and then overwrote the original file path:

ht = hl.read_table(path, _n_partitions=10000)
ht = ht.checkpoint(tmp_path)
ht.write(path, overwrite=True)

However, I noticed the file sizes between the Table written at the original path and the temp path were slightly different.
Table at path: 26064 objects, 37120393 bytes (35.4 MiB)
Table at temp path: 26080 objects, 36527276 bytes (34.84 MiB)

Do you know why these numbers wouldn’t be the same (and if this matters)?

Thanks!

Edit – sorry, I forgot to save the log before shutting down the cluster…

I don’t have the original log, but I ran this:

ht = hl.read_table(
    'gs://sites-for-relatedness-transfer-au-tmp/genomes_v3.1/ld_pruned_combined_variants.ht', 
    _n_partitions=10000,
)
ht.write(
    'gs://gnomad/sample_qc/ht/genomes_v3.1/ld_pruned_combined_variants.ht',
    overwrite=True,
)

The Table in gs://sites-for-relatedness-transfer-au-tmp is 6.43 MiB, and the Table in gs://gnomad is 6.3 Mib. Log:
test_write_size.log (7.7 MB)

This is expected, actually. Write and checkpoint use different compression levels – write uses the max compression codec (slow to write, smallest file), checkpoint uses the fast one (faster to write, larger file). This is based on the assumption that files written will be read many times, but checkpointed will be read only a few times.

that’s super reassuring, thank you!! I’ve never checked file sizes after checkpointing, so I never noticed