Gcloud.dataproc.clusters.create: failed: Cannot start master

Hello,

I’m trying to run hailctl dataproc and create a cluster, but I’m not having any success.

I run the following command:

hailctl dataproc start my-first-cluster --subnet projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3

(I have to specify the subnetwork because otherwise it gives me this error: ERROR: (gcloud.dataproc.clusters.create) NOT_FOUND: The resource ‘projects/dgtex-single-cell/global/networks/default’ was not found)

After running the above command, I got some warnings, it takes like 30 min trying to create the cluster and finally I got an error:

~$hailctl dataproc start my-first-cluster --subnet projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3
hailctl dataproc: Creating a cluster with workers of machine type n1-standard-8.
  Allocating 14592 MB of memory per executor (4 cores),
  with at least 8755 MB for Hail off-heap values and 5837 MB for the JVM.  Using a maximum Hail memory reservation of 3648 MB per core.
gcloud dataproc clusters create my-first-cluster \
    --image-version=2.1.2-debian11 \
    --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=41g|||yarn:yarn.nodemanager.resource.memory-mb=29184|||yarn:yarn.scheduler.maximum-allocation-mb=14592|||spark:spark.executor.cores=4|||spark:spark.executor.memory=5837m|||spark:spark.executor.memoryOverhead=8755m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=3648 \
    --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.124/init_notebook.py \
    --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.124/hail-0.2.124-py3-none-any.whl|||PKGS=aiodns==2.0.0|aiohttp==3.8.5|aiosignal==1.3.1|async-timeout==4.0.3|asyncinit==0.2.4|attrs==23.1.0|avro==1.11.2|azure-common==1.1.28|azure-core==1.29.3|azure-identity==1.14.0|azure-mgmt-core==1.4.0|azure-mgmt-storage==20.1.0|azure-storage-blob==12.17.0|bokeh==3.2.2|boto3==1.28.41|botocore==1.31.41|cachetools==5.3.1|certifi==2023.7.22|cffi==1.15.1|charset-normalizer==3.2.0|click==8.1.7|commonmark==0.9.1|contourpy==1.1.0|cryptography==41.0.3|decorator==4.4.2|deprecated==1.2.14|dill==0.3.7|frozenlist==1.4.0|google-api-core==2.11.1|google-auth==2.22.0|google-auth-oauthlib==0.8.0|google-cloud-core==2.3.3|google-cloud-storage==2.10.0|google-crc32c==1.5.0|google-resumable-media==2.5.0|googleapis-common-protos==1.60.0|humanize==1.1.0|idna==3.4|isodate==0.6.1|janus==1.0.0|jinja2==3.1.2|jmespath==1.0.1|jproperties==2.1.1|markupsafe==2.1.3|msal==1.23.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.7|numpy==1.25.2|oauthlib==3.2.2|orjson==3.9.5|packaging==23.1|pandas==2.1.0|parsimonious==0.10.0|pillow==10.0.0|plotly==5.16.1|portalocker==2.7.0|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.5.0|pyasn1-modules==0.3.0|pycares==4.3.0|pycparser==2.21|pygments==2.16.1|pyjwt[crypto]==2.8.0|python-dateutil==2.8.2|python-json-logger==2.0.7|pytz==2023.3.post1|pyyaml==6.0.1|regex==2023.8.8|requests==2.31.0|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.6.2|scipy==1.11.2|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.3|tornado==6.3.3|typer==0.9.0|typing-extensions==4.7.1|tzdata==2023.3|urllib3==1.26.16|uvloop==0.17.0;sys_platform!="win32"|wrapt==1.15.0|xyzservices==2023.7.0|yarl==1.9.2 \
    --master-machine-type=n1-highmem-8 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-secondary-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=2 \
    --secondary-worker-boot-disk-size=40GB \
    --worker-boot-disk-size=40GB \
    --worker-machine-type=n1-standard-8 \
    --initialization-action-timeout=20m \
    --subnet=projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3 \
    --labels=creator=ldomenec_broadinstitute_org
Starting cluster 'my-first-cluster'...
Waiting on operation [projects/dgtex-single-cell/regions/us-central1/operations/63943d39-b6dd-3930-b9ff-f199ce6dbaca].
Waiting for cluster creation operation...⠛                                                                                                                                                
WARNING: Failed to validate permissions required for default service account: '86961133472-compute@developer.gserviceaccount.com'. Cluster creation could still be successful if required permissions have been granted to the respective service accounts as mentioned in the document https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#dataproc_service_accounts_2. This could be due to Cloud Resource Manager API hasn't been enabled in your project '86961133472' before or it is disabled. Enable it by visiting 'https://console.developers.google.com/apis/api/cloudresourcemanager.googleapis.com/overview?project=86961133472'.
WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.
WARNING: The firewall rules for specified network or subnetwork would likely not permit sufficient communication in the network or subnetwork for Dataproc to function properly. See https://cloud.google.com/dataproc/docs/concepts/network for information on required network setup for Dataproc.
Waiting for cluster creation operation...done.                                                                                                                                            
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/dgtex-single-cell/regions/us-central1/operations/63943d39-b6dd-3930-b9ff-f199ce6dbaca] failed: Cannot start master: Timed out waiting for 2 nodes. This usually happens when VM to VM communications are blocked by firewall rules. For additional details, see https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#firewall_rule_requirement
Operation timed out: Only 0 out of 2 minimum required datanodes running.
Operation timed out: Only 0 out of 2 minimum required node managers running..

I checked the firewall issues and I thought I solved that (I created an ingress firewall rule exactly like it’s specified in Dataproc Cluster Network Configuration  |  Dataproc Documentation  |  Google Cloud) , but I waited and try again the day after and I got the same error again.

I don’t know if I did that correctly or not, or if there’s something else going on… Could you please help me to solve this problem?

Many thanks!

Hey @ldomenech !

I suspect the firewall rule isn’t configured exactly right. Who configured this project for you? At Broad, usually the person configuring it should set up your networks appropriately.

I believe networks always have the same name as their subnets, so your network should be ac-vpc-subnet3. You can confirm the network and subnet names with

gcloud compute networks subnets list --project dgtex-single-cell

Now, if you run

gcloud compute firewall-rules list --project dgtex-single-cell

You should at least this line (possibly among others):

default-allow-internal  ac-vpc-subnet3  INGRESS    1000      tcp:0-65535,udp:0-65535,icmp        False

Replacing ac-vpc-subnet3 with your network.

If that also looks OK, my next best guess is that the IP ranges specified in that rule do not contain the VMs of your dataproc cluster. AFAIK, 10.128.0.0/9 is meant to contain all your IPs. You can verify this after starting your cluster with gcloud compute instances list.

Hi @danking ,

Thank you very much! There was a network/subnetwork + firewall rule configuration problem. I’ve managed to solve it following your steps and creating a new network and subnetwork with the same name + adding the default-allow-internal firewall rule to the new network.

Thanks again,

Laura

1 Like