Dataproc Workers Lost After intensive Task

We are handling a big series of computations VEP, hail computations, and a bunch of joins. After a while, the Dataproc workers seem to be lost from Spark’s perspective.

Symptoms:

  • The Hail output progress stalls
  • The logs (Can provide full if requested) say
2019-07-10 12:41:33 YarnSchedulerBackend$YarnSchedulerEndpoint: WARN: Attempted to get executor loss reason for executor id 212 at RPC address 10.128.0.37:60906, but got no response. Marking as slave [sic] lost.

Testing:
In the past, the same tasks worked on smaller VCFs of up to 4M variants. This VCF is 64M variants.

Conditions:

  1. 2 non-preemptibles, 12 preemptibles stalls at around 3 hours.
  2. 2 non-preemptibles 50 preemptibles stalls later

What could it be that disconnects the workers? Is it the preemption? Should we try with only non-preemptibles?

This definitely looks like preemption.

If each partition of your pipeline takes a very long time to run, then preemption will kill you since the average time to preemption may be smaller than the average time to complete a task.

Increasing the partitioning may help

Thanks Tim. Any insight as to why we see 0 spark workers though? Wouldn’t the new replacement preemptible connect to Spark and at least try to run? We’ll check our partitioning again, but if we load from a vcf, doesn’t it set the default pretty optimal partition size?

The vcf we read is 6766 partitions, 623.34 GiB, and has 62892544 variants over 1329 samples. The ref data we annotate with is 54104 partitions, 104.62 GB, and has 8899147478 rows.

For single-dataset processing, yes. But joins are extremely sensitive to partitioning mismatches – and Spark doesn’t make it especially easy to deal with this problem.

Loading both datasets with the same number of partitions would probably be a good strategy, I think.

Hi Tim, It looks like this might be related to the 38 cluster vep. We are using hailctl to start the cluster with vep 38 and using this config https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/c813e090875de22aac272ea29c1cc8737c58bcd9/hail_scripts/v02/utils/hail_utils.py#L122.
It looks like @bw2 might be seeing the same thing when running vep on 38?

I’m actually not convinced this is pre-emption. Try on a cluster of non-pre-emptibles only and see if it persists. In VEP (and it’s especially worse on GRCh38), I definitely see cases of a worker dying, and never coming back. The short term solution is to add more nodes while the job is running (you can try to decrease number of nodes first and then add them back, though there is some risk of full Spark collapse when you do so). But we really should figure this out long-term, maybe with Google.

Ben found some problems with VEP GRCh38 over the weekend. Let’s try after that goes in?

Thanks Tim, if you’re referring to https://github.com/hail-is/hail/pull/6632/files that should be unrelated, he’s also still seeing those issues.

@konradjk you have not done 38 vep cache right https://github.com/macarthur-lab/gnomad_hail/commit/33c85f0fb40b4c3561c5447e9d626f7c76277725#diff-c3a345a8b489837c633007c0ad2e76b3R250

I think Ben subsetted on a smaller 50k variant subset and it still failed. I’m not familiar with the versioning of hail/vep, but was something introduced recently?

Correct, I have not made a cache yet, partially because I haven’t been able to run VEP all the way through even the gnomAD genomes! (I do have an exome version that I can put up as a cache at some point)

Since this is the first time Ben and I are running 38 on v02 in a long time, could we be doing something wrong?

VEP kept failing yesterday for large and small numbers of variants, so I switched to non-preemptible nodes today, and looks like it ran all the way through on a subset that failed with preemptibles

One thing I’m still wondering about is why cluster utilization is so low while it’s running

Later it went up to ~70% while still running VEP