Parallel Hail Tasks

johnc1231 · February 26, 2020, 9:51pm

Persist is a lazy operation as well. Running annotated.persist() won’t do any computation. When you do the annotated.write, it should do the computation, hold the result of vepping in memory if it can, and then reuse it during export_elasticsearch. At least, that’s my understanding.

nicklecompteBCH · February 27, 2020, 12:44am

Thanks. Still wrapping my head around some of these semantics - and I am gathering there is a difference between Spark’s persisting of RDDs/etc and Hail’s persisting of (Matrix)Tables/etc.

What exactly does annotated.persist() persist then? I had been treating cache and persist as effectful computations that force invocation of the computation stack. Part of my understanding is from the docs:

The MatrixTable.persist() and MatrixTable.cache() methods store the current dataset on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines.

danking · February 27, 2020, 2:20am

I recommend avoiding persist and cache entirely. They interact poorly with preemptible nodes (in GCP) and as you’ve discovered we don’t pay much attention to them. It appears there’s a bug, I’ll have someone fix and cut a new release tomorrow.

In general, just use write/read or the convenience function checkpoint. That’s what we recommend to avoid duplicate work.

nicklecompteBCH · February 27, 2020, 3:09am

Got it - thanks for the update! I’ve changed our pipeline to read/write and will be using that going forward.

johnc1231 · February 28, 2020, 2:51pm

Turns out persist was broken: https://github.com/hail-is/hail/pull/8175

nicklecompteBCH · February 28, 2020, 6:45pm

That does seem more consistent with what I was seeing Thanks, @danking and @johnc1231 for looking in to this even if it’s not Broad/GCP SOP. To clarify: even with these changes in, as a best practice I should still consider persistence in Hail unreliable, and use the read/write/checkpoint workaround? Especially with things like a split_multi or repartition.

We are actually running a persistent HDFS data lake on AWS, having given up on EMR and the S3 “filesystem,” and don’t use pre-emptible datanodes; since we’re not trashing the filesystem between and our jobs tend to be on fairly small datasets (but indexing into very large datasets), being able to reliably persist in memory is a nice-to-have for us. We will keep using read/write for short-term jobs but I will try persisting again when I get a chance to experiment.

danking · February 28, 2020, 6:55pm

IMHO, cache is fine if you have a good reason for it. Cache only uses RAM.

Persist may use hard drives to ensure data is not recomputed. If we have to go to disk anyway, we prefer to write to cloud storage for a couple reasons:

the checkpoint path uses our serialization routines which we find perform better (and continue to improve as a current focus of our work is improving speed and size of serialization)
we prefer to use unreliable (preemptible) compute nodes because they’re 5x cheaper on GCP, ergo the filesystem is unreliable.

If you’re running on reliable nodes, the latter point is less of a concern for you. This is particularly true if you’re already comfortable operating Spark. We’ve found it difficult to understand how caching and persisting affect pipelines that are failing with memory errors.

Topic		Replies	Views
Distributing import_vcf and multi_way_union_mts across Spark workers Hail Query & hailctl	0	125	April 11, 2024
Stage contains a task of very large size Hail Query & hailctl	9	2620	June 1, 2022
Long Stage after Writing without Terminating Hail Query & hailctl	1	291	July 24, 2023
Hail export_vcf() extremely slow and stalls Hail Query & hailctl	4	533	February 2, 2023
Write compressed Tables/Matrices Hail Query & hailctl	1	135	March 25, 2024

Parallel Hail Tasks

Related topics