Hi hail team,
I’m trying to run some code (will send in follow-up email) and keep running into situations where the nodes of my cluster become unhealthy:
I searched through old chats and found that setting
--worker-boot-disk-size=100fixed this issue for a previous user. I tried updating that setting to
--worker-boot-disk-size=200, and the screenshot above is from a cluster with
--worker-boot-disk-size=400. I started the cluster with:
hailctl dataproc start kc --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --num-preemptible-workers 50 --packages gnomad --max-idle 30m --worker-boot-disk-size=400 --master-boot-disk-size=400 --project broad-mpg-gnomad --properties=spark:spark.speculation=true --num-worker-local-ssds 1
Do I need even more disk space? Or is there something else I should try? I would love any tips.