Persist is a lazy operation as well. Running annotated.persist()
won’t do any computation. When you do the annotated.write
, it should do the computation, hold the result of vepping in memory if it can, and then reuse it during export_elasticsearch
. At least, that’s my understanding.
Thanks. Still wrapping my head around some of these semantics - and I am gathering there is a difference between Spark’s persisting of RDDs/etc and Hail’s persisting of (Matrix)Tables/etc.
What exactly does annotated.persist()
persist then? I had been treating cache
and persist
as effectful computations that force invocation of the computation stack. Part of my understanding is from the docs:
The
MatrixTable.persist()
andMatrixTable.cache()
methods store the current dataset on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines.
I recommend avoiding persist and cache entirely. They interact poorly with preemptible nodes (in GCP) and as you’ve discovered we don’t pay much attention to them. It appears there’s a bug, I’ll have someone fix and cut a new release tomorrow.
In general, just use write/read or the convenience function checkpoint. That’s what we recommend to avoid duplicate work.
Got it - thanks for the update! I’ve changed our pipeline to read
/write
and will be using that going forward.
Turns out persist
was broken: https://github.com/hail-is/hail/pull/8175
That does seem more consistent with what I was seeing Thanks, @danking and @johnc1231 for looking in to this even if it’s not Broad/GCP SOP. To clarify: even with these changes in, as a best practice I should still consider persistence in Hail unreliable, and use the read
/write
/checkpoint
workaround? Especially with things like a split_multi
or repartition.
We are actually running a persistent HDFS data lake on AWS, having given up on EMR and the S3 “filesystem,” and don’t use pre-emptible datanodes; since we’re not trashing the filesystem between and our jobs tend to be on fairly small datasets (but indexing into very large datasets), being able to reliably persist in memory is a nice-to-have for us. We will keep using read/write for short-term jobs but I will try persisting again when I get a chance to experiment.
IMHO, cache is fine if you have a good reason for it. Cache only uses RAM.
Persist may use hard drives to ensure data is not recomputed. If we have to go to disk anyway, we prefer to write to cloud storage for a couple reasons:
- the
checkpoint
path uses our serialization routines which we find perform better (and continue to improve as a current focus of our work is improving speed and size of serialization) - we prefer to use unreliable (preemptible) compute nodes because they’re 5x cheaper on GCP, ergo the filesystem is unreliable.
If you’re running on reliable nodes, the latter point is less of a concern for you. This is particularly true if you’re already comfortable operating Spark. We’ve found it difficult to understand how caching and persisting affect pipelines that are failing with memory errors.