Persist vs checkpoint

Hi hail team,

I have a question about using persist. I don’t fully understand when to use persist over checkpoint, and I tend to use checkpoint when I want to avoid redundant computation. Do you have documentation on how to select one over the other and when either one is more appropriate?

Thanks!

1 Like

checkpoint is almost always going to be better. The one case where persist() may be preferred is if you’re writing loops in python that iteratively query a datasets dozens or hundreds of times, in which case a persisted dataset may be slightly faster because parts of it will be in memory as well as disk.

1 Like

Thank you! That helps. I have a specific follow-up question. Our pipeline adds some annotations to a Table and runs persist before calling this function: https://github.com/broadinstitute/gnomad_methods/blob/17bb157ea4703fea899852b55454c6a38bd7bcec/gnomad/variant_qc/random_forest.py#L162. Would a checkpoint be better for this step, or is persist better?

Checkpoint seems like the right thing here.

1 Like