We are handling a big series of computations VEP, hail computations, and a bunch of joins. After a while, the Dataproc workers seem to be lost from Spark’s perspective.
Symptoms:
The Hail output progress stalls
The logs (Can provide full if requested) say
2019-07-10 12:41:33 YarnSchedulerBackend$YarnSchedulerEndpoint: WARN: Attempted to get executor loss reason for executor id 212 at RPC address 10.128.0.37:60906, but got no response. Marking as slave [sic] lost.
From Dataproc’s perspective, the preemptibles are up
If each partition of your pipeline takes a very long time to run, then preemption will kill you since the average time to preemption may be smaller than the average time to complete a task.
Thanks Tim. Any insight as to why we see 0 spark workers though? Wouldn’t the new replacement preemptible connect to Spark and at least try to run? We’ll check our partitioning again, but if we load from a vcf, doesn’t it set the default pretty optimal partition size?
The vcf we read is 6766 partitions, 623.34 GiB, and has 62892544 variants over 1329 samples. The ref data we annotate with is 54104 partitions, 104.62 GB, and has 8899147478 rows.
For single-dataset processing, yes. But joins are extremely sensitive to partitioning mismatches – and Spark doesn’t make it especially easy to deal with this problem.
Loading both datasets with the same number of partitions would probably be a good strategy, I think.
I’m actually not convinced this is pre-emption. Try on a cluster of non-pre-emptibles only and see if it persists. In VEP (and it’s especially worse on GRCh38), I definitely see cases of a worker dying, and never coming back. The short term solution is to add more nodes while the job is running (you can try to decrease number of nodes first and then add them back, though there is some risk of full Spark collapse when you do so). But we really should figure this out long-term, maybe with Google.
I think Ben subsetted on a smaller 50k variant subset and it still failed. I’m not familiar with the versioning of hail/vep, but was something introduced recently?
Correct, I have not made a cache yet, partially because I haven’t been able to run VEP all the way through even the gnomAD genomes! (I do have an exome version that I can put up as a cache at some point)
VEP kept failing yesterday for large and small numbers of variants, so I switched to non-preemptible nodes today, and looks like it ran all the way through on a subset that failed with preemptibles
One thing I’m still wondering about is why cluster utilization is so low while it’s running
We tried disabling loftee (by removing the plugins line in the vep json at run time let me know if this isn’t how it’s done) and we still get the same thing. So far, it seems to happen only when running vep on 38, is there a way to downgrade the vep version and if so how and which version should we set it to? This requires modifying the hailctl dataproc start command right?
We’re trying to find a solution that does not involve increasing the workers or changing them to pre-emptibles for cost and also we’d like to handle fault tolerance.
I ran a separate pipeline last night and saw this same issue. I started my cluster with 2 workers and 20 preemptibles. The pipeline reads in a VCF, runs a validation module, runs liftover when the vcf is on build 38, and then runs other sample QC modules. The pipeline lost all preemptibles and then workers during the liftover portion of the pipeline and never regained them. I was using version 0.2.18-08ec699f0fd4. Unfortunately, even with a max-idle set, the cluster remained up.
We contacted google support and they looked at the logs and responded with:
Thanks for letting me know the timezone. I took a look at the logs and, specifically in the YARN resource manager log (yarn-yarn-resourcemanager-mw2-m.log), I see that there was high disk utilization just before 3:00 PM eastern. The logs are in UTC as you pointed out:
2019-07-18 18:59:38,509 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node mw2-w-4.c.seqr-project.internal:39123 reported UNHEALTHY with details: 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/userlogs : used space above threshold of 90.0% ]
Taking a look at Stackdriver monitoring at 12 PM pacific/3 PM eastern the HDFS utilization goes to 100% and HDFS capacity drops from 319 GiB to 0. This seems to be what's causing your issue. I've attached screenshots of these graphs.
Try provisioning more disk to these instances or examine the job you're running to see why it has high HDFS utilization.
Some questions are
Why is the disk utilization so high? We’re using stage_locally but could that increase it that much? Is it running out of memory and paging to disk?
Could this be causing colony collapse because the new worker is out of disk space and cannot connect to the master? Why is the disk not cleared.