Persist vs checkpoint

ch-kr · June 9, 2020, 3:44pm

Hi hail team,

I have a question about using persist. I don’t fully understand when to use persist over checkpoint, and I tend to use checkpoint when I want to avoid redundant computation. Do you have documentation on how to select one over the other and when either one is more appropriate?

Thanks!

tpoterba · June 11, 2020, 11:58am

checkpoint is almost always going to be better. The one case where persist() may be preferred is if you’re writing loops in python that iteratively query a datasets dozens or hundreds of times, in which case a persisted dataset may be slightly faster because parts of it will be in memory as well as disk.

ch-kr · June 11, 2020, 4:01pm

Thank you! That helps. I have a specific follow-up question. Our pipeline adds some annotations to a Table and runs persist before calling this function: https://github.com/broadinstitute/gnomad_methods/blob/17bb157ea4703fea899852b55454c6a38bd7bcec/gnomad/variant_qc/random_forest.py#L162. Would a checkpoint be better for this step, or is persist better?

tpoterba · June 11, 2020, 6:05pm

Checkpoint seems like the right thing here.

Topic		Replies	Views
Table.annotate takes a while Hail Query & hailctl	6	404	March 15, 2021
Table.export issue Hail Query & hailctl	3	518	September 8, 2020
Room for improvement when joining multiple HTs? Hail Query & hailctl	7	582	November 23, 2021
Table file sizes are different after checkpoint/write Hail Query & hailctl	3	362	June 16, 2022
List of Various Beginner Questions Hail Query & hailctl	1	697	November 18, 2018

Persist vs checkpoint

Related topics