I am using a platform that may error out during the writing of an expensive operation. When running matrix_table.write(destination)
, will previous progress be recognized if the operation has to be interrupted for any reason? What about if I am writing to a bgen format, ± in parallel option?
Yeah, this is a huge headache. We’ve taken great pains to automatically detect and retry transient errors, but we’ll never catch them all.
The short answer is that we’re currently working on this feature but it won’t be ready soon. In the meantime, you can:
mt.write(..., stage_locally=True)
Which will write the files to the local filesystem of the workers before copying them remotely. This can help when the error frequency is related to the age of the remote connection.
The most advanced users of Hail are probably the gnomAD team. My understanding is they tend to perform just one expensive operation per-write. So, for example, they’ll do one round of variant QC, write just the variant metadata, then load that data back to perform the next round. The use of a write/read in between expensive rounds of QC limits the cost of transient failures.
Can you share some of the errors you’re encountering? We can add them to the list of automatically retried transient network errors.
It was an error with the export_bgen that I posted here: Write Bgen Fatal Error - #4 by danking.
I’ll just write to an intermediate file as it does not sound like writing is a particularly expensive operation.