Parallel Hail Tasks


A general question, but my understanding is that functions (while the operation is parallelized) are done serially. Is there a way to invoke 2 hail functions that are read only in parallel?

E.g. can we write a mt to disk and export it to elasticsearch at the same time?


Nope, unfortunately not. However, you can probably speed things up by writing this workflow as:

  1. write as native Table / MatrixTable
  2. read that object
  3. export to elasticsearch


Thanks Tim. I was trying to interleave both IO operations so they can be done in parallel, since both can take a long time and are not computation bound. Your suggestion is still a serial process right? Wouldn’t that take longer than what we currently have, which is calling write and then export on the same mt?


Yes, still serial.

But that will be much faster than:

mt = hl.variant_qc(mt)


Hail builds a computational graph and executes it lazily, which means that in the above example, the variant_qc is getting run twice (once for write, once for export.

If you have expensive upstream operations, it will almost always be faster to write/read/export.


Thanks Tim, that makes sense, very helpful. I’ll double check if that’s what’s going on.


Hi guys, but if you do

vds = hc.import_vcf("test.vcf.gz")
vds = vds.vep("vep.config")
kt = vds.variants_table()
kt.export_elasticsearch("localhost", 9200, "test", "test", 100, config={})

this only runs VEP once, right?


nope, twice.


errr wait yeah that’s right


I think VEP is special and only gets run once, since its results are persisted.


great. I see the persist here


Is there a way to tell how many times spark executed a given step?


oops, this is actually totally wrong. See for a fix.

Hmm, this isn’t super easy. We can’t really do that just in Spark, but could add some tooling on our side.


Awesome, glad a fix came out of this. Do you guys know if this non-persisting was in v01? If it wasn’t persisting, then we were re-computing that multiple times?


this wasn’t an issue in 0.1. It was probably only introduced ~6 weeks ago


Thanks Tim.

Ps. I found these Spark UI visualizations

but it looks somewhat complicated to get access to them on dataproc.