Dataproc Workers Lost After intensive Task

knguyen142 · July 10, 2019, 2:58pm

We are handling a big series of computations VEP, hail computations, and a bunch of joins. After a while, the Dataproc workers seem to be lost from Spark’s perspective.

Symptoms:

The Hail output progress stalls
The logs (Can provide full if requested) say

2019-07-10 12:41:33 YarnSchedulerBackend$YarnSchedulerEndpoint: WARN: Attempted to get executor loss reason for executor id 212 at RPC address 10.128.0.37:60906, but got no response. Marking as slave [sic] lost.

From Dataproc’s perspective, the preemptibles are up

Screen Shot 2019-07-10 at 10.49.59 AM.png387×950 72.7 KB
The Spark UI shows no workers
Resource utilization suggests workers are down

Screen Shot 2019-07-10 at 10.42.59 AM.png1211×1019 75.3 KB

Interesting thing to note is the YARM memory shows as 0 available at times.

Testing:
In the past, the same tasks worked on smaller VCFs of up to 4M variants. This VCF is 64M variants.

Conditions:

2 non-preemptibles, 12 preemptibles stalls at around 3 hours.
2 non-preemptibles 50 preemptibles stalls later

What could it be that disconnects the workers? Is it the preemption? Should we try with only non-preemptibles?

tpoterba · July 10, 2019, 3:13pm

This definitely looks like preemption.

If each partition of your pipeline takes a very long time to run, then preemption will kill you since the average time to preemption may be smaller than the average time to complete a task.

Increasing the partitioning may help

knguyen142 · July 10, 2019, 3:29pm

Thanks Tim. Any insight as to why we see 0 spark workers though? Wouldn’t the new replacement preemptible connect to Spark and at least try to run? We’ll check our partitioning again, but if we load from a vcf, doesn’t it set the default pretty optimal partition size?

mwilson · July 10, 2019, 5:11pm

The vcf we read is 6766 partitions, 623.34 GiB, and has 62892544 variants over 1329 samples. The ref data we annotate with is 54104 partitions, 104.62 GB, and has 8899147478 rows.

tpoterba · July 10, 2019, 5:23pm

For single-dataset processing, yes. But joins are extremely sensitive to partitioning mismatches – and Spark doesn’t make it especially easy to deal with this problem.

Loading both datasets with the same number of partitions would probably be a good strategy, I think.

knguyen142 · July 15, 2019, 2:03pm

Hi Tim, It looks like this might be related to the 38 cluster vep. We are using hailctl to start the cluster with vep 38 and using this config https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/c813e090875de22aac272ea29c1cc8737c58bcd9/hail_scripts/v02/utils/hail_utils.py#L122.
It looks like @bw2 might be seeing the same thing when running vep on 38?

konradjk · July 15, 2019, 3:16pm

I’m actually not convinced this is pre-emption. Try on a cluster of non-pre-emptibles only and see if it persists. In VEP (and it’s especially worse on GRCh38), I definitely see cases of a worker dying, and never coming back. The short term solution is to add more nodes while the job is running (you can try to decrease number of nodes first and then add them back, though there is some risk of full Spark collapse when you do so). But we really should figure this out long-term, maybe with Google.

tpoterba · July 15, 2019, 3:17pm

Ben found some problems with VEP GRCh38 over the weekend. Let’s try after that goes in?

knguyen142 · July 15, 2019, 4:06pm

Thanks Tim, if you’re referring to https://github.com/hail-is/hail/pull/6632/files that should be unrelated, he’s also still seeing those issues.

@konradjk you have not done 38 vep cache right https://github.com/macarthur-lab/gnomad_hail/commit/33c85f0fb40b4c3561c5447e9d626f7c76277725#diff-c3a345a8b489837c633007c0ad2e76b3R250

I think Ben subsetted on a smaller 50k variant subset and it still failed. I’m not familiar with the versioning of hail/vep, but was something introduced recently?

konradjk · July 15, 2019, 4:13pm

Correct, I have not made a cache yet, partially because I haven’t been able to run VEP all the way through even the gnomAD genomes! (I do have an exome version that I can put up as a cache at some point)

knguyen142 · July 15, 2019, 4:14pm

Since this is the first time Ben and I are running 38 on v02 in a long time, could we be doing something wrong?

bw2 · July 15, 2019, 6:58pm

VEP kept failing yesterday for large and small numbers of variants, so I switched to non-preemptible nodes today, and looks like it ran all the way through on a subset that failed with preemptibles

One thing I’m still wondering about is why cluster utilization is so low while it’s running

bw2 · July 15, 2019, 8:27pm

Later it went up to ~70% while still running VEP

knguyen142 · July 18, 2019, 12:14am

We tried disabling loftee (by removing the plugins line in the vep json at run time let me know if this isn’t how it’s done) and we still get the same thing. So far, it seems to happen only when running vep on 38, is there a way to downgrade the vep version and if so how and which version should we set it to? This requires modifying the hailctl dataproc start command right?

We’re trying to find a solution that does not involve increasing the workers or changing them to pre-emptibles for cost and also we’d like to handle fault tolerance.

mwilson · July 18, 2019, 12:46pm

I ran a separate pipeline last night and saw this same issue. I started my cluster with 2 workers and 20 preemptibles. The pipeline reads in a VCF, runs a validation module, runs liftover when the vcf is on build 38, and then runs other sample QC modules. The pipeline lost all preemptibles and then workers during the liftover portion of the pipeline and never regained them. I was using version 0.2.18-08ec699f0fd4. Unfortunately, even with a max-idle set, the cluster remained up.

mwilson · July 18, 2019, 7:08pm

I tried the liftover script on non-preemptibles only and had the same failure, 0 tasks being worked on and then all nodes hit an unhealthy state.

tpoterba · July 18, 2019, 7:11pm

@konradjk, this is colony collapse syndrome, right? Could you write a few sentences dumping what you know about that?

knguyen142 · July 22, 2019, 1:19pm

We contacted google support and they looked at the logs and responded with:

Thanks for letting me know the timezone. I took a look at the logs and, specifically in the YARN resource manager log (yarn-yarn-resourcemanager-mw2-m.log), I see that there was high disk utilization just before 3:00 PM eastern. The logs are in UTC as you pointed out:

2019-07-18 18:59:38,509 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node mw2-w-4.c.seqr-project.internal:39123 reported UNHEALTHY with details: 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/userlogs : used space above threshold of 90.0% ] 

Taking a look at Stackdriver monitoring at 12 PM pacific/3 PM eastern the HDFS utilization goes to 100% and HDFS capacity drops from 319 GiB to 0. This seems to be what's causing your issue. I've attached screenshots of these graphs. 

Try provisioning more disk to these instances or examine the job you're running to see why it has high HDFS utilization.

Some questions are

Why is the disk utilization so high? We’re using stage_locally but could that increase it that much? Is it running out of memory and paging to disk?
Could this be causing colony collapse because the new worker is out of disk space and cannot connect to the master? Why is the disk not cleared.

tpoterba · July 22, 2019, 1:21pm

VEP init scripts download a huge pile of data locally.

adding the FASTA sequence file also demands a lot of disk.

This is a good lead…

konradjk · July 22, 2019, 1:22pm

This is a good lead! Maybe try cranking disk space to something like 400G?

Topic		Replies	Views
Container killed on request. Exit code is 137 Hail Query & hailctl	8	613	October 26, 2021
Google cloud speed up Hail Query & hailctl	10	850	September 18, 2019
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1254	February 3, 2021
Resources not allocated Hail Query & hailctl	4	1845	July 20, 2017
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	752	May 7, 2020

Dataproc Workers Lost After intensive Task

Related topics