Dataproc Workers Lost After intensive Task

We are handling a big series of computations VEP, hail computations, and a bunch of joins. After a while, the Dataproc workers seem to be lost from Spark’s perspective.

Symptoms:

  • The Hail output progress stalls
  • The logs (Can provide full if requested) say
2019-07-10 12:41:33 YarnSchedulerBackend$YarnSchedulerEndpoint: WARN: Attempted to get executor loss reason for executor id 212 at RPC address 10.128.0.37:60906, but got no response. Marking as slave [sic] lost.

Testing:
In the past, the same tasks worked on smaller VCFs of up to 4M variants. This VCF is 64M variants.

Conditions:

  1. 2 non-preemptibles, 12 preemptibles stalls at around 3 hours.
  2. 2 non-preemptibles 50 preemptibles stalls later

What could it be that disconnects the workers? Is it the preemption? Should we try with only non-preemptibles?

This definitely looks like preemption.

If each partition of your pipeline takes a very long time to run, then preemption will kill you since the average time to preemption may be smaller than the average time to complete a task.

Increasing the partitioning may help

Thanks Tim. Any insight as to why we see 0 spark workers though? Wouldn’t the new replacement preemptible connect to Spark and at least try to run? We’ll check our partitioning again, but if we load from a vcf, doesn’t it set the default pretty optimal partition size?

The vcf we read is 6766 partitions, 623.34 GiB, and has 62892544 variants over 1329 samples. The ref data we annotate with is 54104 partitions, 104.62 GB, and has 8899147478 rows.

For single-dataset processing, yes. But joins are extremely sensitive to partitioning mismatches – and Spark doesn’t make it especially easy to deal with this problem.

Loading both datasets with the same number of partitions would probably be a good strategy, I think.

Hi Tim, It looks like this might be related to the 38 cluster vep. We are using hailctl to start the cluster with vep 38 and using this config https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/c813e090875de22aac272ea29c1cc8737c58bcd9/hail_scripts/v02/utils/hail_utils.py#L122.
It looks like @bw2 might be seeing the same thing when running vep on 38?

I’m actually not convinced this is pre-emption. Try on a cluster of non-pre-emptibles only and see if it persists. In VEP (and it’s especially worse on GRCh38), I definitely see cases of a worker dying, and never coming back. The short term solution is to add more nodes while the job is running (you can try to decrease number of nodes first and then add them back, though there is some risk of full Spark collapse when you do so). But we really should figure this out long-term, maybe with Google.

Ben found some problems with VEP GRCh38 over the weekend. Let’s try after that goes in?

Thanks Tim, if you’re referring to https://github.com/hail-is/hail/pull/6632/files that should be unrelated, he’s also still seeing those issues.

@konradjk you have not done 38 vep cache right https://github.com/macarthur-lab/gnomad_hail/commit/33c85f0fb40b4c3561c5447e9d626f7c76277725#diff-c3a345a8b489837c633007c0ad2e76b3R250

I think Ben subsetted on a smaller 50k variant subset and it still failed. I’m not familiar with the versioning of hail/vep, but was something introduced recently?

Correct, I have not made a cache yet, partially because I haven’t been able to run VEP all the way through even the gnomAD genomes! (I do have an exome version that I can put up as a cache at some point)

Since this is the first time Ben and I are running 38 on v02 in a long time, could we be doing something wrong?

VEP kept failing yesterday for large and small numbers of variants, so I switched to non-preemptible nodes today, and looks like it ran all the way through on a subset that failed with preemptibles

One thing I’m still wondering about is why cluster utilization is so low while it’s running

Later it went up to ~70% while still running VEP

We tried disabling loftee (by removing the plugins line in the vep json at run time let me know if this isn’t how it’s done) and we still get the same thing. So far, it seems to happen only when running vep on 38, is there a way to downgrade the vep version and if so how and which version should we set it to? This requires modifying the hailctl dataproc start command right?

We’re trying to find a solution that does not involve increasing the workers or changing them to pre-emptibles for cost and also we’d like to handle fault tolerance.

I ran a separate pipeline last night and saw this same issue. I started my cluster with 2 workers and 20 preemptibles. The pipeline reads in a VCF, runs a validation module, runs liftover when the vcf is on build 38, and then runs other sample QC modules. The pipeline lost all preemptibles and then workers during the liftover portion of the pipeline and never regained them. I was using version 0.2.18-08ec699f0fd4. Unfortunately, even with a max-idle set, the cluster remained up.

I tried the liftover script on non-preemptibles only and had the same failure, 0 tasks being worked on and then all nodes hit an unhealthy state.

@konradjk, this is colony collapse syndrome, right? Could you write a few sentences dumping what you know about that?

We contacted google support and they looked at the logs and responded with:

Thanks for letting me know the timezone. I took a look at the logs and, specifically in the YARN resource manager log (yarn-yarn-resourcemanager-mw2-m.log), I see that there was high disk utilization just before 3:00 PM eastern. The logs are in UTC as you pointed out:

2019-07-18 18:59:38,509 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node mw2-w-4.c.seqr-project.internal:39123 reported UNHEALTHY with details: 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/userlogs : used space above threshold of 90.0% ] 

Taking a look at Stackdriver monitoring at 12 PM pacific/3 PM eastern the HDFS utilization goes to 100% and HDFS capacity drops from 319 GiB to 0. This seems to be what's causing your issue. I've attached screenshots of these graphs. 

Try provisioning more disk to these instances or examine the job you're running to see why it has high HDFS utilization. 

Some questions are

  1. Why is the disk utilization so high? We’re using stage_locally but could that increase it that much? Is it running out of memory and paging to disk?
  2. Could this be causing colony collapse because the new worker is out of disk space and cannot connect to the master? Why is the disk not cleared.

VEP init scripts download a huge pile of data locally.

adding the FASTA sequence file also demands a lot of disk.

This is a good lead…

This is a good lead! Maybe try cranking disk space to something like 400G?