Executor Lost Failure when writing out a MT for WGS pvcf


I’m running a fairly straightforward operation. I read in a WGS pvcf ~150mil variants with ~2k individuals. I split the file into biallelic, run variant_qc, filter on aaf and write out the matrix table.

While writing out the matrix table Hail hangs without giving me any error message and CPU usage goes down to zero. Looking into the spark logs some of the jobs fail with the error:
‘ExecutorLostFailure (executor 9 exited unrelated to the running tasks) Reason: Container marked as failed’
The only other output hail gives is the message ‘Hail: INFO: Ordering unsorted dataset with shuffle’ while trying to output the matrix table.

I am running this on GCP with the following configurations:
–master-machine-type n1-standard-8
–worker-machine-type n1-highmem-8 (20 non-preemptible nodes, have tried different types of VMs)
–properties spark:spark.driver.maxResultSize=8g,spark:spark.executor.memory=4g
Hail version: 0.2.89-38264124ad91

I’ve run larger WGS files before so am not sure why this error is coming up.

Thank you very much!

It’s hard to comment without the full script and the Hail log file.

What does the Spark progress bar look like? How many total partitions are there, how many successfully complete, and how many are in progress when it hangs?

What versions of Hail did you use for the larger WGS files?

It’s possible whatever issues you have are fixed in the latest version of Hail, 0.2.107. Unfortunately, we don’t have the engineering capacity to deeply investigate performance issues in old versions.

Makes sense. I’ll rerun my script on the latest version.
Previously we were using the same version (0.289) as well.
Where would I find the hail log file? The errors I picked were from the spark history server.

The hail log file should be in the working directory of the Jupyter notebook or Python process. If you’re submitting to a Spark cluster, this is usually the home directory of some user.