Hello,
I’m trying to run hailctl dataproc and create a cluster, but I’m not having any success.
I run the following command:
hailctl dataproc start my-first-cluster --subnet projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3
(I have to specify the subnetwork because otherwise it gives me this error: ERROR: (gcloud.dataproc.clusters.create) NOT_FOUND: The resource ‘projects/dgtex-single-cell/global/networks/default’ was not found)
After running the above command, I got some warnings, it takes like 30 min trying to create the cluster and finally I got an error:
~$hailctl dataproc start my-first-cluster --subnet projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3
hailctl dataproc: Creating a cluster with workers of machine type n1-standard-8.
Allocating 14592 MB of memory per executor (4 cores),
with at least 8755 MB for Hail off-heap values and 5837 MB for the JVM. Using a maximum Hail memory reservation of 3648 MB per core.
gcloud dataproc clusters create my-first-cluster \
--image-version=2.1.2-debian11 \
--properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=41g|||yarn:yarn.nodemanager.resource.memory-mb=29184|||yarn:yarn.scheduler.maximum-allocation-mb=14592|||spark:spark.executor.cores=4|||spark:spark.executor.memory=5837m|||spark:spark.executor.memoryOverhead=8755m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=3648 \
--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.124/init_notebook.py \
--metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.124/hail-0.2.124-py3-none-any.whl|||PKGS=aiodns==2.0.0|aiohttp==3.8.5|aiosignal==1.3.1|async-timeout==4.0.3|asyncinit==0.2.4|attrs==23.1.0|avro==1.11.2|azure-common==1.1.28|azure-core==1.29.3|azure-identity==1.14.0|azure-mgmt-core==1.4.0|azure-mgmt-storage==20.1.0|azure-storage-blob==12.17.0|bokeh==3.2.2|boto3==1.28.41|botocore==1.31.41|cachetools==5.3.1|certifi==2023.7.22|cffi==1.15.1|charset-normalizer==3.2.0|click==8.1.7|commonmark==0.9.1|contourpy==1.1.0|cryptography==41.0.3|decorator==4.4.2|deprecated==1.2.14|dill==0.3.7|frozenlist==1.4.0|google-api-core==2.11.1|google-auth==2.22.0|google-auth-oauthlib==0.8.0|google-cloud-core==2.3.3|google-cloud-storage==2.10.0|google-crc32c==1.5.0|google-resumable-media==2.5.0|googleapis-common-protos==1.60.0|humanize==1.1.0|idna==3.4|isodate==0.6.1|janus==1.0.0|jinja2==3.1.2|jmespath==1.0.1|jproperties==2.1.1|markupsafe==2.1.3|msal==1.23.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.7|numpy==1.25.2|oauthlib==3.2.2|orjson==3.9.5|packaging==23.1|pandas==2.1.0|parsimonious==0.10.0|pillow==10.0.0|plotly==5.16.1|portalocker==2.7.0|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.5.0|pyasn1-modules==0.3.0|pycares==4.3.0|pycparser==2.21|pygments==2.16.1|pyjwt[crypto]==2.8.0|python-dateutil==2.8.2|python-json-logger==2.0.7|pytz==2023.3.post1|pyyaml==6.0.1|regex==2023.8.8|requests==2.31.0|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.6.2|scipy==1.11.2|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.3|tornado==6.3.3|typer==0.9.0|typing-extensions==4.7.1|tzdata==2023.3|urllib3==1.26.16|uvloop==0.17.0;sys_platform!="win32"|wrapt==1.15.0|xyzservices==2023.7.0|yarl==1.9.2 \
--master-machine-type=n1-highmem-8 \
--master-boot-disk-size=100GB \
--num-master-local-ssds=0 \
--num-secondary-workers=0 \
--num-worker-local-ssds=0 \
--num-workers=2 \
--secondary-worker-boot-disk-size=40GB \
--worker-boot-disk-size=40GB \
--worker-machine-type=n1-standard-8 \
--initialization-action-timeout=20m \
--subnet=projects/dgtex-single-cell/regions/us-central1/subnetworks/ac-vpc-subnet3 \
--labels=creator=ldomenec_broadinstitute_org
Starting cluster 'my-first-cluster'...
Waiting on operation [projects/dgtex-single-cell/regions/us-central1/operations/63943d39-b6dd-3930-b9ff-f199ce6dbaca].
Waiting for cluster creation operation...⠛
WARNING: Failed to validate permissions required for default service account: '86961133472-compute@developer.gserviceaccount.com'. Cluster creation could still be successful if required permissions have been granted to the respective service accounts as mentioned in the document https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#dataproc_service_accounts_2. This could be due to Cloud Resource Manager API hasn't been enabled in your project '86961133472' before or it is disabled. Enable it by visiting 'https://console.developers.google.com/apis/api/cloudresourcemanager.googleapis.com/overview?project=86961133472'.
WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.
WARNING: The firewall rules for specified network or subnetwork would likely not permit sufficient communication in the network or subnetwork for Dataproc to function properly. See https://cloud.google.com/dataproc/docs/concepts/network for information on required network setup for Dataproc.
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/dgtex-single-cell/regions/us-central1/operations/63943d39-b6dd-3930-b9ff-f199ce6dbaca] failed: Cannot start master: Timed out waiting for 2 nodes. This usually happens when VM to VM communications are blocked by firewall rules. For additional details, see https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#firewall_rule_requirement
Operation timed out: Only 0 out of 2 minimum required datanodes running.
Operation timed out: Only 0 out of 2 minimum required node managers running..
I checked the firewall issues and I thought I solved that (I created an ingress firewall rule exactly like it’s specified in Dataproc Cluster Network Configuration | Dataproc Documentation | Google Cloud) , but I waited and try again the day after and I got the same error again.
I don’t know if I did that correctly or not, or if there’s something else going on… Could you please help me to solve this problem?
Many thanks!