I’ve used the hailctl dataproc start command along with --no-address and --subnet to try to initialize a dataproc cluster that only uses internal ip addresses in order to avoid hitting the in-use ip address quota limit on GCP.
I believe cluster creation failed because it was unable to connect to the internet to install the required packages due to the use of private ip addresses. Does anybody have any ideas on how to get around this?
I’ve never heard of anyone bumping into that quota, and we have made some pretty huge clusters. It’s possible the easiest solution might be to ask Google to up your quota, but I don’t know the details of your organization.
You could probably get away with downloading the dependencies into Google Storage and then writing a new init script that reads them from there, but that won’t be a pleasant experience either, and will not be fun to maintain.
It’s also worth noting that only the driver node of the cluster needs to install anything from the internet, so if you could configure things so only the leader node has a public ip, this will probably work.
Yeah for some reason in our organization it seems to be our most common resource limit. We’ve asked Google to up our quota, but they denied and suggested we look into using internal IP addresses. The suggestion to configure only the driver node with a public IP address sounds like a good place to start. Thank you for your help!