All nodes are unhealthy

Hi hail team,

I’m trying to run some code (will send in follow-up email) and keep running into situations where the nodes of my cluster become unhealthy:

.
I searched through old chats and found that setting --worker-boot-disk-size=100 fixed this issue for a previous user. I tried updating that setting to --worker-boot-disk-size=200, and the screenshot above is from a cluster with --worker-boot-disk-size=400. I started the cluster with:

hailctl dataproc start kc --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --num-preemptible-workers 50 --packages gnomad --max-idle 30m --worker-boot-disk-size=400 --master-boot-disk-size=400 --project broad-mpg-gnomad --properties=spark:spark.speculation=true --num-worker-local-ssds 1

Do I need even more disk space? Or is there something else I should try? I would love any tips.

I think you might want to try increasing the number of primary (non-preemptible) workers as well. For pipelines that use HDFS as temp space (this can be hard to predict, but this one probably is) it’s a good idea to have no more than ~5-10x as many preemptible as non-preemptible workers.

1 Like

Thank you!! I switched to --worker-boot-disk-size=500 with 100 primary and 10 preemptible workers. 10 workers became unhealthy (guessing the preemptibles?), but the job completed with the remaining 100 nodes.

Also note that --worker-boot-disk-size applies only to the primary workers.

As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms

If preemptible workers are still going unhealthy due to disk space issues, you may need to set --secondary-worker-boot-disk-size.

1 Like