"lost node" failures when running hl.experimental.run_combiner()

Hello,

I ran ‘hl.experimental.run_combiner()’ with ‘branch factor value=100’ and ‘batch factor = 100’ for 1000 WGS gvcfs in GCP. In my successful run, there are some failed tasks causing multiple attempts and long runtime. I used the n1-highmem-32 machine type, spark.executor.mem = 72g and spark.executor.core = 8. Based on these spark conf parameters, each node launched up to 2 executors. Please give me some advice on how to remove these failure tasks. Thank you.

  1. Round 1 (1000 gvcfs --> 10 MT) : 234 failed / 250 tasks
    Error 1 : ExecutorLostFailure (executor 30 exited unrelated to the running tasks) Reason: Container marked as failed: container_1594240183345_0002_01_000045 on host: ****. Exit status: -100. Diagnostics: Container released on a lost node.

  2. Round 2 (10 MT --> 1 MT) : 4567 failed / 95623 tasks
    Error 2 : ExecutorLostFailure (executor 78 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594240183345_0002_01_000124 on host: ****. Exit status: 134. Diagnostics: [2020-07-11 08:44:42.110]Exception from container-launch.

    Error 3 : java.io.IOException: Error accessing gs://*****/metadata.json.gz

Best,
Jina

This is quite weird. Can you paste the full stack trace? You really shouldn’t need to be using these high memory settings, there’s almost certainly something wrong in the Hail runtime that we can fix.

Also, we’ve fixed a few memory leaks in the last couple versions, so make sure you’re on the latest (0.2.49 at time of posting)

Hello Tim,

  1. Unfortunately, the log file for 1000 gvcfs was overwritten by 200 gvcfs case ( my another run). Instead, I attached the lost node error parts in the 200 gvcfs case log file.

lost node error parts In the 200 gvcfs combiner.txt|attachment (6.2 KB)

Please review it. And let me know if you need more information for solving this issue. The total log file size is about 200M. By the way, I am curious if these errors are related to memory size.

  1. For your information, with executor.core = 8, executor.mem = 38g, executor.memOverhead = 15g, when I ran run_combiner for 1000 gvcfs, I got a lot of error of this type :

ExecutorLostFailure (executor 27 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 53.0 GB of 53 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

But after increasing the memory size of executor, this type error was removed.

  1. In the addition, I found out that the output file sizes of run_combiner() with the same input and different spark configuration is different. How can I find out if the output MTs are identical or not?

Thank you for your supports.

Best,
Jina

Dear Jina

Apologies for our late reply.
1 and 3) I’ll tag @tpoterba

We’re glad that your memory size issue was solved with the executor memory size tweaking.

Hello Kumar,

Thank you for the reply. Unfortunately, I could not resolve “lost node” error so far. If I can get any insight from the Hail team, I will really appreciate it.

Best,
Jina

@jinasong,

Sorry for the recent instability in the Hail library. We’re investigating more thorough scale testing practices that will discover these problems before release. We believe your issue might be fixed in Hail 0.2.52.

Hi @danking,

I will retry it with the latest version of Hail and keep posting. Thank you for your support.

Best,
Jina

Hi @danking,

I ran it again with Hail 0.2.54. In this new version, I found out the run_combiner() function requires the information of ‘use_genome_default_intervals’ and gave it ‘True’. After that, I can find a dramatically increased number of tasks. But, I encountered another type of error as below.

Error message : error reading tabix-indexed file gs://my-project/my-bucket/my-sample.g.vcf.gz: i=0, curOff=386139765735504, expected=386139765735424
at is.hail.io.tabix.TabixLineIterator.next(TabixReader.scala:417)…

I have never seen this type of error in other previous versions.

Could you give me an idea to solve this issue? Thank you.

Best,
Jina

We’ve replicated this issue with public data, and are working on a fix.

Sounds great. Looking forward to the good news.

OK, we’ve characterized the problem (it’s a bug in control flow when a VCF line ends on the last byte of a compressed BGZ block). It’s not a trivial one-line fix, so stay tuned.

Hi Tim,

I just saw that the new Hail version 0.2.55 released. I wonder if the issue in the run_combiner() function was resolved. Thank you.

Best,
Jina

This is fixed but the fix went in after 0.2.55. We can make a new release today.

Hi Tim,

I updated Hail, as of version 0.2.56 and tested run_combiner() function with 100, 1k, and 10k gvcf files (average size : 6G) each. Run_combiner() runs for 100 gvcfs and 1k gvcfs completed successfully through multiple attempts of failed subtasks showing similar messages as before. Run time in a new version was faster than in the previous Hail version. Thanks much for your and your team’s work.

Q1. By the way, the sizes of output MTs are different from the outputs from previous Hail version 0.2.52.
: A sparse MT size for 100 gvcfs - 317G (in v0.2.56), 600G (in v0.2.52)
: A sparse MT size for 1k gvcfs - 2.8T (in v0.2.56), 6T (in v0.2.52)
Please let me know how I should interpret this.

Q2. In addition, unfortunately, the job for 10k gvcfs has been failed. The first round of 100 batches, merging 100 gvcfs to a sparse MT, completed successfully, but it failed when starting the second round with the error message as below. If you let me know how to resolve this issue, I will really appreciate it.

– Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[*******,DISK]] are bad

I found out 100 sparse MTs generated by the first round in a run_combiner() run in my temp storage.
Is it possible to combine 100 MTs to 1 MT with any other Hail function?

Thank you.
-Jina

Please ignore my first question. The output size is the same as before. Sorry about that. I am looking forward to your advice for combining 10k gvcfs successfully I mentioned in my second question. Thank you

-Jina