Dataproc Workers Lost After intensive Task

Note that log is for @mwilson’s liftover, I started running my seqr pipeline with 100G, is that still too little?
Edit: Up from the default 40G.

Might be ok, but I just wanted to see if this could even possibly fix it so went 10X.

Re: liftover, that’s also pretty hard on disk since it loads the FASTA. Maybe a new manifestation of https://github.com/hail-is/hail/issues/5371?

The script with liftover suceeded after increasing the disk to 100GB in workers and pre-emptibles.

1 Like

If your cluster is still running, can you ssh into it and run df -h? Curious if that disk went to boot disk or HDFS

Filesystem      Size  Used Avail Use% Mounted on
udev             26G     0   26G   0% /dev
tmpfs           5.2G   12M  5.1G   1% /run
/dev/sda1        99G  7.5G   87G   8% /
tmpfs            26G     0   26G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            26G     0   26G   0% /sys/fs/cgroup
tmpfs           5.2G     0  5.2G   0% /run/user/110
tmpfs           5.2G     0  5.2G   0% /run/user/111
tmpfs           5.2G     0  5.2G   0% /run/user/113
tmpfs           5.2G     0  5.2G   0% /run/user/112
tmpfs           5.2G     0  5.2G   0% /run/user/115
tmpfs           5.2G     0  5.2G   0% /run/user/6967

Err, oops, I guess you changed the workers, not the master (that’s already 100G by default), so the change would actually be in one of those. But I just noticed the parameter is --worker-boot-disk-size so that’s definitely going to the local boot drive

Sorry, sshed into a worker and similar results

Filesystem      Size  Used Avail Use% Mounted on
udev             15G     0   15G   0% /dev
tmpfs           3.0G   11M  3.0G   1% /run
/dev/sda1        99G  5.7G   89G   6% /
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
tmpfs           3.0G     0  3.0G   0% /run/user/111
tmpfs           3.0G     0  3.0G   0% /run/user/6967

Increasing from 40G to 100G seems to have fixed the seqr issue we were having, looks like we know a potential fix for colony collapse. Should we tell the EPA?

All along, we just needed a bigger hive…

OK, here’s my rough understanding:

  1. We create temporary files on executors during Spark execution. These temporary files will be cleaned up by a cue from the driver machine (?), not on the death of the executor java process.
  2. An executor fills up local disk and dies due to exceeding disk or memory resources.
  3. No new executors can be started on that node because it does not have the requested resources.
  4. Proceed to new node, start at (1)

The real problem here seems that temp files aren’t getting cleaned up when executors die.

That seems sensible though does it explain why they all die at once? Seems like they’d be filling the temp space at slightly different rates (but maybe it’s close enough)?

This also explains why one way to solve the problem is just more nodes and don’t even let any single node’s temp space fill up (e.g. if you’re only running ~2-3 tasks per executor, this isn’t a problem - it’s only after it runs a few)

Yeah, it might be just close enough. I don’t have the graphs anymore, but it could be that all of the nodes fail in a gradual manner not all at once, since they all have roughly the same work and same disk utilization.

I think increasing the nodes would help, but it could the minimum disk size is a factor of the actual task, so no matter how high of a node count and low of a task/node you have, you’ll need a minimum disk size for a single task.

In the liftover script, I know all my nodes were dying in a 10 second span, whether that is all at once or gradual…I don’t know.