Parallel Hail Tasks


#1

A general question, but my understanding is that functions (while the operation is parallelized) are done serially. Is there a way to invoke 2 hail functions that are read only in parallel?

E.g. can we write a mt to disk and export it to elasticsearch at the same time?


#2

Nope, unfortunately not. However, you can probably speed things up by writing this workflow as:

  1. write as native Table / MatrixTable
  2. read that object
  3. export to elasticsearch

#3

Thanks Tim. I was trying to interleave both IO operations so they can be done in parallel, since both can take a long time and are not computation bound. Your suggestion is still a serial process right? Wouldn’t that take longer than what we currently have, which is calling write and then export on the same mt?


#4

Yes, still serial.

But that will be much faster than:

mt = hl.variant_qc(mt)

mt.write(...)
hl.export_elasticsearch(mt...)

Hail builds a computational graph and executes it lazily, which means that in the above example, the variant_qc is getting run twice (once for write, once for export.

If you have expensive upstream operations, it will almost always be faster to write/read/export.


#5

Thanks Tim, that makes sense, very helpful. I’ll double check if that’s what’s going on.


#6

Hi guys, but if you do

vds = hc.import_vcf("test.vcf.gz")
vds = vds.vep("vep.config")
vds.write("test.vds")
kt = vds.variants_table()
kt.export_elasticsearch("localhost", 9200, "test", "test", 100, config={})

this only runs VEP once, right?


#7

nope, twice.


#8

errr wait yeah that’s right


#9

I think VEP is special and only gets run once, since its results are persisted.


#10

great. I see the persist here


#11

Is there a way to tell how many times spark executed a given step?


#12

oops, this is actually totally wrong. See https://github.com/hail-is/hail/pull/5416 for a fix.

Hmm, this isn’t super easy. We can’t really do that just in Spark, but could add some tooling on our side.


#13

Awesome, glad a fix came out of this. Do you guys know if this non-persisting was in v01? If it wasn’t persisting, then we were re-computing that multiple times?


#14

this wasn’t an issue in 0.1. It was probably only introduced ~6 weeks ago


#15

Thanks Tim.

Ps. I found these Spark UI visualizations

but it looks somewhat complicated to get access to them on dataproc.