I have a question about using persist. I don’t fully understand when to use persist over checkpoint, and I tend to use checkpoint when I want to avoid redundant computation. Do you have documentation on how to select one over the other and when either one is more appropriate?
checkpoint is almost always going to be better. The one case where persist() may be preferred is if you’re writing loops in python that iteratively query a datasets dozens or hundreds of times, in which case a persisted dataset may be slightly faster because parts of it will be in memory as well as disk.