Hail cluster creation error

i am getting the following error while making cluster on google cloud after running the following command…

hailctl dataproc --beta start hailpy --vep GRCh37 --optional-components=ANACONDA,JUPYTER --enable-component-gateway --bucket bucketname–project projectname–region us-central1


Your active configuration is: [cloudshell-28616]
gcloud beta dataproc clusters create \
    hailpy \
    --image-version=1.4-debian9 \
    --properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g \
--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.27/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.27/vep-GRCh37.sh \
    --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.27/hail-0.2.27-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3 \
    --master-machine-type=n1-highmem-8 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-preemptible-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=2 \
--preemptible-worker-boot-disk-size=200GB \
    --worker-boot-disk-size=200GB \
    --worker-machine-type=n1-highmem-8 \
    --zone=us-central1-b \
    --initialization-action-timeout=20m \
    --project=..... \
    --bucket=..\
    --labels=creator=...\
    --optional-components=ANACONDA,JUPYTER \
    --enable-component-gateway \
    --region \
    us-central1
Starting cluster 'hailpy'...
Waiting on operation [projects/.../regions/us-central1/operations/fb08d024-7087-3e83-9101-3640e376aa9b].
WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.
Waiting for cluster creation operation...done.
**ERROR**: (gcloud.beta.dataproc.clusters.create) Operation [projects/cncdanalyses/regions/us-central1/operations/fb08d024-7087-3e83-9101-3640e376aa9b] failed: Multiple Errors:
 - Timeout waiting for instance hailpy-m to report in.
 - Timeout waiting for instance hailpy-w-0 to report in.
 - Timeout waiting for instance hailpy-w-1 to report in..
Traceback (most recent call last):
  File "/home/zahidhaseeb46/env/bin/hailctl", line 8, in <module>
    sys.exit(main())
File "/home/.../env/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 94, in main
    cli.main(args)
  File "/home/.../env/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 107, in main
    jmp[args.module].main(args, pass_through_args)
  File "/home/.../env/lib/python3.7/site-packages/hailtop/hailctl/dataproc/start.py", line 200, in main
    sp.check_call(cmd)
  File "/usr/local/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'beta', 'dataproc', 'clusters', 'create', 'hailpy', '--image-version=1.4-debian9', '--properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g', '--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.27/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.27/vep-GRCh37.sh', '--metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.27/hail-0.2.27-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3', '--master-machine-type=n1-highmem-8', '--master-boot-disk-size=100GB', '--num-master-local-ssds=0', '--num-preemptible-workers=0', '--num-worker-local-ssds=0', '--num-workers=2', '--preemptible-worker-boot-disk-size=200GB', '--worker-boot-disk-size=200GB', '--worker-machine-type=n1-highmem-8', '--zone=us-central1-b', '--initialization-action-timeout=20m', '--project=...', '--bucket=...', '--labels=creator=...', '--optional-components=ANACONDA,JUPYTER', '--enable-component-gateway', '--region', 'us-central1']' returned non-zero exit status 1.

I think this is a google cloud failure – can you try again today?

ok i will try it again but i have been trying from 3 to 4 days but same result…

Again tried it but failed with same error…

hailctl dataproc --beta start hailpy --vep GRCh37 --optional-components=ANACONDA,JUPYTER --enable-component-gateway --bucket … --project … --region us-central1

ERROR: (gcloud.beta.dataproc.clusters.create) Operation [projects/…/regions/us-central1/operations/f6631ead-39b6-34cd-82a6-dbb802adff1e] failed: Multiple Errors:

  • Timeout waiting for instance hailpy-m to report in.
  • Timeout waiting for instance hailpy-w-0 to report in.
  • Timeout waiting for instance hailpy-w-1 to report in…
    Traceback (most recent call last):
    File “/home/…/env/bin/hailctl”, line 8, in
    sys.exit(main())
    File “/home/…/env/lib/python3.7/site-packages/hailtop/hailctl/main.py”, line 94, in main
    cli.main(args)
    File “/home/…/env/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py”, line 107, in main
    jmp[args.module].main(args, pass_through_args)
    File “/home/…/env/lib/python3.7/site-packages/hailtop/hailctl/dataproc/start.py”, line 200, in main
    sp.check_call(cmd)
    File “/usr/local/lib/python3.7/subprocess.py”, line 347, in check_call
    raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command ‘[‘gcloud’, ‘beta’, ‘dataproc’, ‘clusters’, ‘create’, ‘hailpy’, ‘–image-version=1.4-debian9’, ‘–properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g’, ‘–initialization-actions=gs://hail-common/hailctl/dataproc/0.2.27/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.27/vep-GRCh37.sh’, ‘–metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.27/hail-0.2.27-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3’, ‘–master-machine-type=n1-highmem-8’, ‘–master-boot-disk-size=100GB’, ‘–num-master-local-ssds=0’,’–num-preemptible-workers=0’, ‘–num-worker-local-ssds=0’, ‘–num-workers=2’, ‘–preemptible-worker-boot-disk-size=200GB’, ‘–worker-boot-disk-size=200GB’, ‘–worker-machine-type=n1-highmem-8’, ‘–zone=us-central1-b’, ‘–initialization-action-timeout=20m’,’–project=…’, ‘–bucket=…’, ‘–labels=creator=…_gmail_com’, ‘–optional-components=ANACONDA,JUPYTER’, ‘–enable-component-gateway’, ‘–region’, ‘us-central1’]’ returned non-zero exit status 1.

I created a normal cluster without any other parameters as hailctl dataproc start develop and it was successful… but creating cluster using this
hailctl dataproc --beta start hailpy --vep GRCh37 --optional components=ANACONDA,JUPYTER --enable-component-gateway --bucket … --project … --region us-central1
gives error…

I think I mostly understand what’s going on. We set an initialization action timeout of 20m. We know VEP takes ~10-15m to install, and I think that with the --optional components=ANACONDA,JUPYTER that pushes the initialization over the 20m limit.

Why are you installing anaconda and jupyter using gcloud dataproc components? We install miniconda and jupyter notebooks in our initialization scripts (you can connect to a jupyter notebook instance using hailctl dataproc connect develop notebook once it’s created).

hailctl dataproc start haseeb-hail --vep GRCh37

I have now used this command but still getting the same error

Does it work without VEP? I know you need VEP, but might be useful to know for debugging purposes.

@Haseeb1 can you attach to this thread the Dataproc initialization logs?

No, it is not working.

sorry @danking i cant workout the logging files

I am using a slightly modified command created by hailctl to launch a cluster with the suggested initialization actions for Jupyter notebook. I am getting the following error.
ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Google Cloud Storage bucket does not exist ‘dataproc-6ffff547-d093-44e7-ad2c-11c2bd8bb779-us-central1’

Does the error refer to the initialization script?
My configurations for dataproc are set for us-central1

gcloud dataproc clusters create testhail-cluster
–image-version=1.4-debian9
–properties=“spark:spark.speculation=true,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=128g,spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1”
–metadata="^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.33/hail-0.2.33-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1"
–initialization-actions=gs://hail-common/hailctl/dataproc/0.2.33/init_notebook.py
–master-machine-type=“n1-highmem-8”
–master-boot-disk-size=500GB
–num-master-local-ssds=0
–num-worker-local-ssds=0
–num-workers=16
–secondary-worker-boot-disk-size=100GB
–worker-boot-disk-size=40GB
–worker-machine-type=“n1-standard-8”
–region=us-central1
–initialization-action-timeout=20m \

Thank you

This is quite weird, I haven’t seen something like this before. What are the modifications from standard hailctl?

You’re also on version 0.2.33. Might be worth updating to 0.2.40

I made minor modifications to the boot disk size and number of workers. Here are the suggested configurations from dry-run
gcloud dataproc clusters create my-cluster
–image-version=1.4-debian9
–properties=spark:spark.task.maxFailures=20,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g
–initialization-actions=gs://hail-common/hailctl/dataproc/0.2.33/init_notebook.py
–metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.33/hail-0.2.33-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1
–master-machine-type=n1-highmem-8
–master-boot-disk-size=100GB
–num-master-local-ssds=0
–num-preemptible-workers=0
–num-worker-local-ssds=0
–num-workers=2
–preemptible-worker-boot-disk-size=40GB
–worker-boot-disk-size=40GB
–worker-machine-type=n1-standard-8
–zone=us-central1-b
–initialization-action-timeout=20m \

That’s a regional stating bucket that is missing. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket

They’re supposed to be automatically created. I’d either check with Google to ask why it is not begin created when you run gcloud dataproc create. If you just run

gcloud dataproc clusters create testhail-cluster \
   --image-version=1.4-debian9 --region=us-central1

does that work?

I am getting the same error with a gcloud dataproc create.
I also tried setting a staging bucket in console for a cluster. I got an error that my service account “does not have storage.buckets.get access to the Google Cloud Storage bucket”. I am not sure how to set this permission in IAM. I was asked to create a custom role. I can instantiateVM in my default project but not a cluster. Thank you for your help.

You have a Broad Institute email account, so I assume you’re working with a Broad GCP project. Broad GCP projects should all have access to the paid premium Google Support (you can see details here).

It seems your Dataproc settings are misconfigured. I do not understand why or how this can happen. You should be able to grant the service account full access to the bucket via the Google Cloud Web UI, there’s some details here.

There might be a permissions issue. Does your group have a data manager or someone else who administers privileges on Google? Your account might need more privileges.

Google support should be able to help me. I will also reach out to people who helped us set up my service account. Thanks for your suggestions.