Parallel Hail Tasks

knguyen142 · February 14, 2019, 9:57pm

A general question, but my understanding is that functions (while the operation is parallelized) are done serially. Is there a way to invoke 2 hail functions that are read only in parallel?

E.g. can we write a mt to disk and export it to elasticsearch at the same time?

tpoterba · February 14, 2019, 10:02pm

Nope, unfortunately not. However, you can probably speed things up by writing this workflow as:

write as native Table / MatrixTable
read that object
export to elasticsearch

knguyen142 · February 15, 2019, 2:32pm

Thanks Tim. I was trying to interleave both IO operations so they can be done in parallel, since both can take a long time and are not computation bound. Your suggestion is still a serial process right? Wouldn’t that take longer than what we currently have, which is calling write and then export on the same mt?

tpoterba · February 15, 2019, 2:34pm

Yes, still serial.

But that will be much faster than:

mt = hl.variant_qc(mt)

mt.write(...)
hl.export_elasticsearch(mt...)

Hail builds a computational graph and executes it lazily, which means that in the above example, the variant_qc is getting run twice (once for write, once for export.

If you have expensive upstream operations, it will almost always be faster to write/read/export.

knguyen142 · February 15, 2019, 2:40pm

Thanks Tim, that makes sense, very helpful. I’ll double check if that’s what’s going on.

bw2 · February 22, 2019, 12:04am

Hi guys, but if you do

vds = hc.import_vcf("test.vcf.gz")
vds = vds.vep("vep.config")
vds.write("test.vds")
kt = vds.variants_table()
kt.export_elasticsearch("localhost", 9200, "test", "test", 100, config={})

this only runs VEP once, right?

tpoterba · February 22, 2019, 1:37am

nope, twice.

tpoterba · February 22, 2019, 1:37am

errr wait yeah that’s right

tpoterba · February 22, 2019, 1:38am

I think VEP is special and only gets run once, since its results are persisted.

bw2 · February 22, 2019, 2:56am

great. I see the persist here

github.com

hail-is/hail/blob/master/hail/src/main/scala/is/hail/methods/VEP.scala#L242


    it.map { case (v, vep) =>
      rvb.start(vepRowType)
      rvb.startStruct()
      rvb.addAnnotation(vepRowType.types(0).virtualType, v.asInstanceOf[Row].get(0))
      rvb.addAnnotation(vepRowType.types(1).virtualType, v.asInstanceOf[Row].get(1))
      rvb.addAnnotation(vepRowType.types(2).virtualType, vep)
      rvb.endStruct()
      rv.setOffset(rvb.end())
      rv
    }
  }).persist(StorageLevel.MEMORY_AND_DISK)


val (globalValue, globalType) =
  if (csq)
    (Row(csqHeader.getOrElse("")), TStruct("vep_csq_header" -> TString()))
  else
    (Row(), TStruct())


TableValue(
  TableType(vepRowType.virtualType, FastIndexedSeq("locus", "alleles"), globalType),
  BroadcastRow(globalValue, globalType, HailContext.get.sc),

bw2 · February 22, 2019, 4:59pm

Is there a way to tell how many times spark executed a given step?

tpoterba · February 22, 2019, 5:31pm

oops, this is actually totally wrong. See [bugfix] Fix persisting of vep, logistic regression, poisson regression by tpoterba · Pull Request #5416 · hail-is/hail · GitHub for a fix.

Hmm, this isn’t super easy. We can’t really do that just in Spark, but could add some tooling on our side.

knguyen142 · February 22, 2019, 6:45pm

Awesome, glad a fix came out of this. Do you guys know if this non-persisting was in v01? If it wasn’t persisting, then we were re-computing that multiple times?

tpoterba · February 22, 2019, 7:22pm

this wasn’t an issue in 0.1. It was probably only introduced ~6 weeks ago

bw2 · February 22, 2019, 8:10pm

Thanks Tim.

Ps. I found these Spark UI visualizations

but it looks somewhat complicated to get access to them on dataproc.

nicklecompteBCH · February 24, 2020, 10:52pm

Sorry to open up an old conversation: is VEP no longer persisting on version 0.2.32? I am possibly confused on the semantics here.

The call to persist appears to be gone in the latest VEP.scala, and looking at my executor stderr logs VEP is definitely being run twice between writing to disk and writing to Elasticsearch, despite manually persisting the MatrixTable in between.

It’s not a big deal to rewrite the pipeline, just not sure what’s expected. It does seem that the behavior has changed.

johnc1231 · February 25, 2020, 9:49pm

Maybe it’s not working as intended, but the idea was to move the persist from Scala to python: https://github.com/hail-is/hail/pull/5416

nicklecompteBCH · February 26, 2020, 2:28pm

Great, thanks for the clarification on the expected behavior.

It does seem that the MatrixTable.persist method does not actually persist VEP; I have to do MatrixTable.write and read from the written table in order to use VEP-annotated variants downstream without rerunning VEP.

johnc1231 · February 26, 2020, 3:11pm

I’ll look into this. Just to clarify, you have code like:

vepped = hl.methods.vep(mt)
vepped = vepped.persist() # Important to save the persisted thing to a variable
vepped.write(....)
vepped.export_elasticsearch(...) #I forget what the elasticsearch method is called

nicklecompteBCH · February 26, 2020, 9:30pm

Before:

vepped = hl.methods.vep(mt)
vepped = vepped.persist()
annotated = vepped.index_rows(some stuff)
annotated = annotated.persist() # VEP runs again during this step - again, I thought it wasn't supposed to
annoated.write(path)
final = read_matrix_table(path)
hl.export_elasticsearch(annoated)

My fix to get around VEP recomputing:

vepped = hl.methods.vep(mt)
vepped = vepped.persist()
vepped.write('tmpmt.mt')
vepped = read_matrix_table('tmpmt.mt')
annotated = vepped.index_rows(some stuff)
annotated = annotated.persist() 
annoated.write(path)
final = read_matrix_table(path)
hl.export_elasticsearch(annoated)

Just to assuage my own confusion: I thought the expected behavior was that VEP would not re-run after persisting.

Topic		Replies	Views
Distributing import_vcf and multi_way_union_mts across Spark workers Hail Query & hailctl	0	127	April 11, 2024
Writing my table as csv or vcf or ht takes too long Hail Query & hailctl	5	54	May 4, 2025
Stage contains a task of very large size Hail Query & hailctl	9	2763	June 1, 2022
Long Stage after Writing without Terminating Hail Query & hailctl	1	291	July 24, 2023
Hail export_vcf() extremely slow and stalls Hail Query & hailctl	4	555	February 2, 2023

Parallel Hail Tasks

Related topics