"lost node" failures when running hl.experimental.run_combiner()

jinasong · July 13, 2020, 11:48pm

Hello,

I ran ‘hl.experimental.run_combiner()’ with ‘branch factor value=100’ and ‘batch factor = 100’ for 1000 WGS gvcfs in GCP. In my successful run, there are some failed tasks causing multiple attempts and long runtime. I used the n1-highmem-32 machine type, spark.executor.mem = 72g and spark.executor.core = 8. Based on these spark conf parameters, each node launched up to 2 executors. Please give me some advice on how to remove these failure tasks. Thank you.

Round 1 (1000 gvcfs --> 10 MT) : 234 failed / 250 tasks
Error 1 : ExecutorLostFailure (executor 30 exited unrelated to the running tasks) Reason: Container marked as failed: container_1594240183345_0002_01_000045 on host: ****. Exit status: -100. Diagnostics: Container released on a lost node.
Round 2 (10 MT --> 1 MT) : 4567 failed / 95623 tasks
Error 2 : ExecutorLostFailure (executor 78 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594240183345_0002_01_000124 on host: ****. Exit status: 134. Diagnostics: [2020-07-11 08:44:42.110]Exception from container-launch.

Error 3 : java.io.IOException: Error accessing gs://*****/metadata.json.gz

Best,
Jina

tpoterba · July 14, 2020, 12:19am

This is quite weird. Can you paste the full stack trace? You really shouldn’t need to be using these high memory settings, there’s almost certainly something wrong in the Hail runtime that we can fix.

tpoterba · July 14, 2020, 12:27pm

Also, we’ve fixed a few memory leaks in the last couple versions, so make sure you’re on the latest (0.2.49 at time of posting)

jinasong · July 14, 2020, 9:19pm

Hello Tim,

Unfortunately, the log file for 1000 gvcfs was overwritten by 200 gvcfs case ( my another run). Instead, I attached the lost node error parts in the 200 gvcfs case log file.

lost node error parts In the 200 gvcfs combiner.txt|attachment (6.2 KB)

Please review it. And let me know if you need more information for solving this issue. The total log file size is about 200M. By the way, I am curious if these errors are related to memory size.

For your information, with executor.core = 8, executor.mem = 38g, executor.memOverhead = 15g, when I ran run_combiner for 1000 gvcfs, I got a lot of error of this type :

ExecutorLostFailure (executor 27 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 53.0 GB of 53 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

But after increasing the memory size of executor, this type error was removed.

In the addition, I found out that the output file sizes of run_combiner() with the same input and different spark configuration is different. How can I find out if the output MTs are identical or not?

Thank you for your supports.

Best,
Jina

kumarveerapen · July 30, 2020, 3:17pm

Dear Jina

Apologies for our late reply.
1 and 3) I’ll tag @tpoterba

We’re glad that your memory size issue was solved with the executor memory size tweaking.

jinasong · July 30, 2020, 5:52pm

Hello Kumar,

Thank you for the reply. Unfortunately, I could not resolve “lost node” error so far. If I can get any insight from the Hail team, I will really appreciate it.

Best,
Jina

danking · July 30, 2020, 6:22pm

@jinasong,

Sorry for the recent instability in the Hail library. We’re investigating more thorough scale testing practices that will discover these problems before release. We believe your issue might be fixed in Hail 0.2.52.

jinasong · July 30, 2020, 8:41pm

Hi @danking,

I will retry it with the latest version of Hail and keep posting. Thank you for your support.

Best,
Jina

jinasong · August 17, 2020, 7:19pm

Hi @danking,

I ran it again with Hail 0.2.54. In this new version, I found out the run_combiner() function requires the information of ‘use_genome_default_intervals’ and gave it ‘True’. After that, I can find a dramatically increased number of tasks. But, I encountered another type of error as below.

Error message : error reading tabix-indexed file gs://my-project/my-bucket/my-sample.g.vcf.gz: i=0, curOff=386139765735504, expected=386139765735424
at is.hail.io.tabix.TabixLineIterator.next(TabixReader.scala:417)…

I have never seen this type of error in other previous versions.

Could you give me an idea to solve this issue? Thank you.

Best,
Jina

tpoterba · August 17, 2020, 8:39pm

We’ve replicated this issue with public data, and are working on a fix.

jinasong · August 17, 2020, 11:22pm

Sounds great. Looking forward to the good news.

tpoterba · August 18, 2020, 3:03pm

OK, we’ve characterized the problem (it’s a bug in control flow when a VCF line ends on the last byte of a compressed BGZ block). It’s not a trivial one-line fix, so stay tuned.

jinasong · August 26, 2020, 9:35pm

Hi Tim,

I just saw that the new Hail version 0.2.55 released. I wonder if the issue in the run_combiner() function was resolved. Thank you.

Best,
Jina

tpoterba · August 27, 2020, 1:12pm

This is fixed but the fix went in after 0.2.55. We can make a new release today.

jinasong · September 4, 2020, 7:02am

Hi Tim,

I updated Hail, as of version 0.2.56 and tested run_combiner() function with 100, 1k, and 10k gvcf files (average size : 6G) each. Run_combiner() runs for 100 gvcfs and 1k gvcfs completed successfully through multiple attempts of failed subtasks showing similar messages as before. Run time in a new version was faster than in the previous Hail version. Thanks much for your and your team’s work.

Q1. By the way, the sizes of output MTs are different from the outputs from previous Hail version 0.2.52.
: A sparse MT size for 100 gvcfs - 317G (in v0.2.56), 600G (in v0.2.52)
: A sparse MT size for 1k gvcfs - 2.8T (in v0.2.56), 6T (in v0.2.52)
Please let me know how I should interpret this.

Q2. In addition, unfortunately, the job for 10k gvcfs has been failed. The first round of 100 batches, merging 100 gvcfs to a sparse MT, completed successfully, but it failed when starting the second round with the error message as below. If you let me know how to resolve this issue, I will really appreciate it.

– Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[*******,DISK]] are bad

I found out 100 sparse MTs generated by the first round in a run_combiner() run in my temp storage.
Is it possible to combine 100 MTs to 1 MT with any other Hail function?

Thank you.
-Jina

jinasong · September 8, 2020, 11:53pm

Please ignore my first question. The output size is the same as before. Sorry about that. I am looking forward to your advice for combining 10k gvcfs successfully I mentioned in my second question. Thank you

-Jina

Topic		Replies	Views
Merge multiple sparse MT to one sparse MT Hail Query & hailctl	5	404	September 21, 2020
Error in calling vcf_combiner Hail Query & hailctl	14	663	July 28, 2021
Possible vcf_combiner issue Hail Query & hailctl	19	1246	June 15, 2020
Hl.experimental.run_combiner() AssertionError Hail Query & hailctl	11	586	July 16, 2020
Trouble with vcf_combiner on gcloud dataproc cluster Hail Query & hailctl	2	327	October 22, 2021

"lost node" failures when running hl.experimental.run_combiner()

Related topics