Hi hail team,
I have a question about using
persist. I don’t fully understand when to use
checkpoint, and I tend to use
checkpoint when I want to avoid redundant computation. Do you have documentation on how to select one over the other and when either one is more appropriate?
checkpoint is almost always going to be better. The one case where persist() may be preferred is if you’re writing loops in python that iteratively query a datasets dozens or hundreds of times, in which case a persisted dataset may be slightly faster because parts of it will be in memory as well as disk.
Thank you! That helps. I have a specific follow-up question. Our pipeline adds some annotations to a Table and runs
persist before calling this function: https://github.com/broadinstitute/gnomad_methods/blob/17bb157ea4703fea899852b55454c6a38bd7bcec/gnomad/variant_qc/random_forest.py#L162. Would a checkpoint be better for this step, or is persist better?
Checkpoint seems like the right thing here.