Table file sizes are different after checkpoint/write

ch-kr · June 15, 2022, 3:37pm

Hi hail team!

I have a basic question about Hail Table file sizes. I had a Table with >100K partitions (most of them were empty), so I repartitioned the Table on read, checkpointed, and then overwrote the original file path:

ht = hl.read_table(path, _n_partitions=10000)
ht = ht.checkpoint(tmp_path)
ht.write(path, overwrite=True)

However, I noticed the file sizes between the Table written at the original path and the temp path were slightly different.
Table at path: 26064 objects, 37120393 bytes (35.4 MiB)
Table at temp path: 26080 objects, 36527276 bytes (34.84 MiB)

Do you know why these numbers wouldn’t be the same (and if this matters)?

Thanks!

Edit – sorry, I forgot to save the log before shutting down the cluster…

ch-kr · June 15, 2022, 4:30pm

I don’t have the original log, but I ran this:

ht = hl.read_table(
    'gs://sites-for-relatedness-transfer-au-tmp/genomes_v3.1/ld_pruned_combined_variants.ht', 
    _n_partitions=10000,
)
ht.write(
    'gs://gnomad/sample_qc/ht/genomes_v3.1/ld_pruned_combined_variants.ht',
    overwrite=True,
)

The Table in gs://sites-for-relatedness-transfer-au-tmp is 6.43 MiB, and the Table in gs://gnomad is 6.3 Mib. Log:
test_write_size.log (7.7 MB)

tpoterba · June 16, 2022, 1:14am

This is expected, actually. Write and checkpoint use different compression levels – write uses the max compression codec (slow to write, smallest file), checkpoint uses the fast one (faster to write, larger file). This is based on the assumption that files written will be read many times, but checkpointed will be read only a few times.

ch-kr · June 16, 2022, 1:39pm

that’s super reassuring, thank you!! I’ve never checked file sizes after checkpointing, so I never noticed

Topic		Replies	Views
Hail MT Directory Size Hail Query & hailctl	9	384	December 7, 2022
Write compressed Tables/Matrices Hail Query & hailctl	1	135	March 25, 2024
Hail Repartition returns uneven partitions with one very large partition Hail Query & hailctl	5	425	March 21, 2023
Table partitioning Hail Query & hailctl	1	392	July 26, 2021
Hail MT Directory Size Hail Query & hailctl	0	190	December 6, 2022

Table file sizes are different after checkpoint/write

Related topics