"lost node" failures when running hl.experimental.run_combiner()

Hello,

I ran ‘hl.experimental.run_combiner()’ with ‘branch factor value=100’ and ‘batch factor = 100’ for 1000 WGS gvcfs in GCP. In my successful run, there are some failed tasks causing multiple attempts and long runtime. I used the n1-highmem-32 machine type, spark.executor.mem = 72g and spark.executor.core = 8. Based on these spark conf parameters, each node launched up to 2 executors. Please give me some advice on how to remove these failure tasks. Thank you.

  1. Round 1 (1000 gvcfs --> 10 MT) : 234 failed / 250 tasks
    Error 1 : ExecutorLostFailure (executor 30 exited unrelated to the running tasks) Reason: Container marked as failed: container_1594240183345_0002_01_000045 on host: ****. Exit status: -100. Diagnostics: Container released on a lost node.

  2. Round 2 (10 MT --> 1 MT) : 4567 failed / 95623 tasks
    Error 2 : ExecutorLostFailure (executor 78 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594240183345_0002_01_000124 on host: ****. Exit status: 134. Diagnostics: [2020-07-11 08:44:42.110]Exception from container-launch.

    Error 3 : java.io.IOException: Error accessing gs://*****/metadata.json.gz

Best,
Jina

This is quite weird. Can you paste the full stack trace? You really shouldn’t need to be using these high memory settings, there’s almost certainly something wrong in the Hail runtime that we can fix.

Also, we’ve fixed a few memory leaks in the last couple versions, so make sure you’re on the latest (0.2.49 at time of posting)

Hello Tim,

  1. Unfortunately, the log file for 1000 gvcfs was overwritten by 200 gvcfs case ( my another run). Instead, I attached the lost node error parts in the 200 gvcfs case log file.

lost node error parts In the 200 gvcfs combiner.txt|attachment (6.2 KB)

Please review it. And let me know if you need more information for solving this issue. The total log file size is about 200M. By the way, I am curious if these errors are related to memory size.

  1. For your information, with executor.core = 8, executor.mem = 38g, executor.memOverhead = 15g, when I ran run_combiner for 1000 gvcfs, I got a lot of error of this type :

ExecutorLostFailure (executor 27 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 53.0 GB of 53 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

But after increasing the memory size of executor, this type error was removed.

  1. In the addition, I found out that the output file sizes of run_combiner() with the same input and different spark configuration is different. How can I find out if the output MTs are identical or not?

Thank you for your supports.

Best,
Jina

Dear Jina

Apologies for our late reply.
1 and 3) I’ll tag @tpoterba

We’re glad that your memory size issue was solved with the executor memory size tweaking.

Hello Kumar,

Thank you for the reply. Unfortunately, I could not resolve “lost node” error so far. If I can get any insight from the Hail team, I will really appreciate it.

Best,
Jina

@jinasong,

Sorry for the recent instability in the Hail library. We’re investigating more thorough scale testing practices that will discover these problems before release. We believe your issue might be fixed in Hail 0.2.52.

Hi @danking,

I will retry it with the latest version of Hail and keep posting. Thank you for your support.

Best,
Jina