.
I searched through old chats and found that setting --worker-boot-disk-size=100 fixed this issue for a previous user. I tried updating that setting to --worker-boot-disk-size=200, and the screenshot above is from a cluster with --worker-boot-disk-size=400. I started the cluster with:
I think you might want to try increasing the number of primary (non-preemptible) workers as well. For pipelines that use HDFS as temp space (this can be hard to predict, but this one probably is) it’s a good idea to have no more than ~5-10x as many preemptible as non-preemptible workers.
Thank you!! I switched to --worker-boot-disk-size=500 with 100 primary and 10 preemptible workers. 10 workers became unhealthy (guessing the preemptibles?), but the job completed with the remaining 100 nodes.