A general question, but my understanding is that functions (while the operation is parallelized) are done serially. Is there a way to invoke 2 hail functions that are read only in parallel?
E.g. can we write a mt to disk and export it to elasticsearch at the same time?
Thanks Tim. I was trying to interleave both IO operations so they can be done in parallel, since both can take a long time and are not computation bound. Your suggestion is still a serial process right? Wouldn’t that take longer than what we currently have, which is calling write and then export on the same mt?
Hail builds a computational graph and executes it lazily, which means that in the above example, the variant_qc is getting run twice (once for write, once for export.
If you have expensive upstream operations, it will almost always be faster to write/read/export.
Awesome, glad a fix came out of this. Do you guys know if this non-persisting was in v01? If it wasn’t persisting, then we were re-computing that multiple times?
Sorry to open up an old conversation: is VEP no longer persisting on version 0.2.32? I am possibly confused on the semantics here.
The call to persist appears to be gone in the latest VEP.scala, and looking at my executor stderr logs VEP is definitely being run twice between writing to disk and writing to Elasticsearch, despite manually persisting the MatrixTable in between.
It’s not a big deal to rewrite the pipeline, just not sure what’s expected. It does seem that the behavior has changed.
Great, thanks for the clarification on the expected behavior.
It does seem that the MatrixTable.persist method does not actually persist VEP; I have to do MatrixTable.write and read from the written table in order to use VEP-annotated variants downstream without rerunning VEP.
I’ll look into this. Just to clarify, you have code like:
vepped = hl.methods.vep(mt)
vepped = vepped.persist() # Important to save the persisted thing to a variable
vepped.write(....)
vepped.export_elasticsearch(...) #I forget what the elasticsearch method is called
vepped = hl.methods.vep(mt)
vepped = vepped.persist()
annotated = vepped.index_rows(some stuff)
annotated = annotated.persist() # VEP runs again during this step - again, I thought it wasn't supposed to
annoated.write(path)
final = read_matrix_table(path)
hl.export_elasticsearch(annoated)