Dataproc Workers Lost After intensive Task

knguyen142 · July 22, 2019, 1:23pm

Note that log is for @mwilson’s liftover, I started running my seqr pipeline with 100G, is that still too little?
Edit: Up from the default 40G.

konradjk · July 22, 2019, 1:24pm

Might be ok, but I just wanted to see if this could even possibly fix it so went 10X.

Re: liftover, that’s also pretty hard on disk since it loads the FASTA. Maybe a new manifestation of https://github.com/hail-is/hail/issues/5371?

mwilson · July 22, 2019, 2:52pm

The script with liftover suceeded after increasing the disk to 100GB in workers and pre-emptibles.

konradjk · July 22, 2019, 2:59pm

If your cluster is still running, can you ssh into it and run df -h? Curious if that disk went to boot disk or HDFS

mwilson · July 22, 2019, 3:39pm

Filesystem      Size  Used Avail Use% Mounted on
udev             26G     0   26G   0% /dev
tmpfs           5.2G   12M  5.1G   1% /run
/dev/sda1        99G  7.5G   87G   8% /
tmpfs            26G     0   26G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            26G     0   26G   0% /sys/fs/cgroup
tmpfs           5.2G     0  5.2G   0% /run/user/110
tmpfs           5.2G     0  5.2G   0% /run/user/111
tmpfs           5.2G     0  5.2G   0% /run/user/113
tmpfs           5.2G     0  5.2G   0% /run/user/112
tmpfs           5.2G     0  5.2G   0% /run/user/115
tmpfs           5.2G     0  5.2G   0% /run/user/6967

konradjk · July 22, 2019, 3:42pm

Err, oops, I guess you changed the workers, not the master (that’s already 100G by default), so the change would actually be in one of those. But I just noticed the parameter is --worker-boot-disk-size so that’s definitely going to the local boot drive

mwilson · July 22, 2019, 3:45pm

Sorry, sshed into a worker and similar results

Filesystem      Size  Used Avail Use% Mounted on
udev             15G     0   15G   0% /dev
tmpfs           3.0G   11M  3.0G   1% /run
/dev/sda1        99G  5.7G   89G   6% /
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
tmpfs           3.0G     0  3.0G   0% /run/user/111
tmpfs           3.0G     0  3.0G   0% /run/user/6967

knguyen142 · July 23, 2019, 1:22pm

Increasing from 40G to 100G seems to have fixed the seqr issue we were having, looks like we know a potential fix for colony collapse. Should we tell the EPA?

tpoterba · July 23, 2019, 1:42pm

All along, we just needed a bigger hive…

tpoterba · July 23, 2019, 1:45pm

OK, here’s my rough understanding:

We create temporary files on executors during Spark execution. These temporary files will be cleaned up by a cue from the driver machine (?), not on the death of the executor java process.
An executor fills up local disk and dies due to exceeding disk or memory resources.
No new executors can be started on that node because it does not have the requested resources.
Proceed to new node, start at (1)

The real problem here seems that temp files aren’t getting cleaned up when executors die.

konradjk · July 23, 2019, 2:22pm

That seems sensible though does it explain why they all die at once? Seems like they’d be filling the temp space at slightly different rates (but maybe it’s close enough)?

This also explains why one way to solve the problem is just more nodes and don’t even let any single node’s temp space fill up (e.g. if you’re only running ~2-3 tasks per executor, this isn’t a problem - it’s only after it runs a few)

knguyen142 · July 23, 2019, 2:27pm

Yeah, it might be just close enough. I don’t have the graphs anymore, but it could be that all of the nodes fail in a gradual manner not all at once, since they all have roughly the same work and same disk utilization.

I think increasing the nodes would help, but it could the minimum disk size is a factor of the actual task, so no matter how high of a node count and low of a task/node you have, you’ll need a minimum disk size for a single task.

mwilson · July 23, 2019, 2:52pm

In the liftover script, I know all my nodes were dying in a 10 second span, whether that is all at once or gradual…I don’t know.

Topic		Replies	Views
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	742	May 7, 2020
How to create a cluster with 8 cpus and 0 preemptible Hail Query & hailctl	6	1410	May 10, 2020
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	377	November 2, 2021
Google cloud speed up Hail Query & hailctl	10	845	September 18, 2019
Executor Lost Failure when writing out a MT for WGS pvcf Hail Query & hailctl	3	388	February 6, 2023

Dataproc Workers Lost After intensive Task

Related topics