All nodes are unhealthy

ch-kr · November 2, 2020, 5:37pm

Hi hail team,

I’m trying to run some code (will send in follow-up email) and keep running into situations where the nodes of my cluster become unhealthy:

.
I searched through old chats and found that setting --worker-boot-disk-size=100 fixed this issue for a previous user. I tried updating that setting to --worker-boot-disk-size=200, and the screenshot above is from a cluster with --worker-boot-disk-size=400. I started the cluster with:

hailctl dataproc start kc --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --num-preemptible-workers 50 --packages gnomad --max-idle 30m --worker-boot-disk-size=400 --master-boot-disk-size=400 --project broad-mpg-gnomad --properties=spark:spark.speculation=true --num-worker-local-ssds 1

Do I need even more disk space? Or is there something else I should try? I would love any tips.

tpoterba · November 2, 2020, 8:11pm

I think you might want to try increasing the number of primary (non-preemptible) workers as well. For pipelines that use HDFS as temp space (this can be hard to predict, but this one probably is) it’s a good idea to have no more than ~5-10x as many preemptible as non-preemptible workers.

ch-kr · November 3, 2020, 2:21pm

Thank you!! I switched to --worker-boot-disk-size=500 with 100 primary and 10 preemptible workers. 10 workers became unhealthy (guessing the preemptibles?), but the job completed with the remaining 100 nodes.

nawatts · November 4, 2020, 6:30pm

Also note that --worker-boot-disk-size applies only to the primary workers.

As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
– 辅助工作器 - 抢占式虚拟机和非抢占式虚拟机 | Dataproc 文档 | Google Cloud

If preemptible workers are still going unhealthy due to disk space issues, you may need to set --secondary-worker-boot-disk-size.

Topic		Replies	Views
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	748	May 7, 2020
How to create a cluster with 8 cpus and 0 preemptible Hail Query & hailctl	6	1419	May 10, 2020
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	382	November 2, 2021
"Hail off-heap memory exceeded maximum threshold" error on large analysis job Hail Query & hailctl	1	304	April 18, 2023
Hail having difficulty scaling to 400K Development	2	35	June 9, 2025

All nodes are unhealthy

Related topics