We are handling a big series of computations VEP, hail computations, and a bunch of joins. After a while, the Dataproc workers seem to be lost from Spark’s perspective.
- The Hail output progress stalls
- The logs (Can provide full if requested) say
2019-07-10 12:41:33 YarnSchedulerBackend$YarnSchedulerEndpoint: WARN: Attempted to get executor loss reason for executor id 212 at RPC address 10.128.0.37:60906, but got no response. Marking as slave [sic] lost.
- From Dataproc’s perspective, the preemptibles are up
- The Spark UI shows no workers
- Resource utilization suggests workers are down
Interesting thing to note is the YARM memory shows as 0 available at times.
In the past, the same tasks worked on smaller VCFs of up to 4M variants. This VCF is 64M variants.
- 2 non-preemptibles, 12 preemptibles stalls at around 3 hours.
- 2 non-preemptibles 50 preemptibles stalls later
What could it be that disconnects the workers? Is it the preemption? Should we try with only non-preemptibles?