How to create a cluster with 8 cpus and 0 preemptible

alanwilter · May 10, 2020, 9:21am

Assuming this is possible, of course.
I’m trying to build a cluster, where I’m limited to 8 cpus (free trial).
I don’t want any --num-preemptible-workers and I understand that the minimal number of workers is 2. (BTW, is master = workers?)

So why when I try something like:

hailctl dataproc start vep-hail --vep GRCh37 --region europe-west2 --master-machine-type=n1-highmem-4

I got this:

Pulling VEP data from bucket in uk.
gcloud dataproc clusters create vep-hail \
    --image-version=1.4-debian9 \
    --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=20g \
    --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.39/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.39/vep-GRCh37.sh \
    --metadata=^|||^VEP_REPLICATE=uk|||WHEEL=gs://hail-common/hailctl/dataproc/0.2.39/hail-0.2.39-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1 \
    --master-machine-type=n1-highmem-4 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-preemptible-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=2 \
    --preemptible-worker-boot-disk-size=200GB \
    --worker-boot-disk-size=200GB \
    --worker-machine-type=n1-highmem-8 \
    --region=europe-west2 \
    --initialization-action-timeout=20m \
    --labels=creator=alanwilter_gmail_com
Starting cluster 'vep-hail'...
WARNING: The `--num-preemptible-workers` flag is deprecated. Use the `--num-secondary-workers` flag instead.
WARNING: The `--preemptible-worker-boot-disk-size` flag is deprecated. Use the `--secondary-worker-boot-disk-size` flag instead.
ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Multiple validation errors:
 - Insufficient 'CPUS' quota. Requested 20.0, available 8.0.
 - Insufficient 'CPUS_ALL_REGIONS' quota. Requested 20.0, available 12.0.
 - This request exceeds CPU quota. Some things to try: request fewer workers (a minimum of 2 is required), use smaller master and/or worker machine types (such as n1-standard-2).
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 108, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/start.py", line 346, in main
    sp.check_call(cmd)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'clusters', 'create', 'vep-hail', '--image-version=1.4-debian9', '--properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=20g', '--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.39/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.39/vep-GRCh37.sh', '--metadata=^|||^VEP_REPLICATE=uk|||WHEEL=gs://hail-common/hailctl/dataproc/0.2.39/hail-0.2.39-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1', '--master-machine-type=n1-highmem-4', '--master-boot-disk-size=100GB', '--num-master-local-ssds=0', '--num-preemptible-workers=0', '--num-worker-local-ssds=0', '--num-workers=2', '--preemptible-worker-boot-disk-size=200GB', '--worker-boot-disk-size=200GB', '--worker-machine-type=n1-highmem-8', '--region=europe-west2', '--initialization-action-timeout=20m', '--labels=creator=alanwilter_gmail_com']' returned non-zero exit status 1.

According to GCP doc n1-highmem-4 or n1-standard-4 have 4 vCPUs, so I should have a cluster with 2 workers with 4 vCPUs each, hence total 8 vCPUs, but hailctl dataproc ... command is asking for 20!

Any help here please? Many thanks in advance.

alanwilter · May 10, 2020, 9:40am

Hmm… I think I know what mess I’m doing… I’m confusing master x work x preemptible.

What I was hoping actually is just one computer (node), a master, no workers, where the master would have 8 cpus and do the whole work. But this is not possible with hailctl, right? Is it a limitation of hailctl or is at yarn/spark level (I’m new to this kind of clusters, sorry).

johnc1231 · May 10, 2020, 12:35pm

There is always only a single master. The master is the computer that the python interpreter is installed on. Any non-hail python computing that you do will happen there.

Workers and preemptible-workers are almost the same thing. They are both additional computers in your cluster that get assigned work by the master. In hail, the majority of your computation will take place on the workers. This includes all of the VEP work. The main difference is that preemptible workers are that transient. Sometimes Google will take one of them away from you mid computation if demand from other users is high. You only pay for the machines you have at any given time though, so you stop paying for it once it’s taken from you. Preemptible workers are also significantly cheaper than regular ones.

You can specify number of regular workers with --num-workers 4 if you want 4 of them. You always need at least 2 regular workers though.

You should read this for a more detailed overview than I’ve given here. I think it will help a lot: https://github.com/danking/hail-cloud-docs/blob/master/how-to-cloud.md

johnc1231 · May 10, 2020, 12:39pm

To answer your question though: you cannot do what you’re asking for (only one computer with 8cpus) with hailctl dataproc. hailctl dataproc creates clusters where most of the computation (including VEP) is done on the workers.

danking · May 10, 2020, 12:49pm

I suspect you have good reasons to avoid paying for Dataproc but you might check the costs (https://cloud.google.com/products/calculator#id=e241059a-556f-473b-a9dc-b550afad1a13). Running a minimal cluster costs a couple bucks an hour.

cseed · May 10, 2020, 1:12pm

hailctl dataproc works by calling the Google gcloud command to create and work with Dataproc clusters. Dataproc supports single-node clusters with the --single-node option. This isn’t currently exposed in hailctl. But you can run hailctl dataproc create --dry-run args... to see the gcloud command hailctl would run. Then you can modify that command to add --single-node and run it yourself.

alanwilter · May 10, 2020, 2:25pm

Thank you all guys!

@johnc1231 very detailed explanation, I really appreciated that.

@danking We have AWS account. I forked Hail-on-aws-spot, did several modifications and started to get VEP working there but I stopped short as VEP is very time consuming to install. Now that I learned how you guys use docker, I may explore a similar solution in the future. However, so far, I’m finding GCP cheaper than AWS! I need to investigate this properly.

@cseed Yeap, I’ve looking into this and now that you confirmed my suspicions I will give a try eventually.

Topic		Replies	Views
Setting number of preemptible workers in `hailctl dataproc start` Hail Query & hailctl	11	750	May 7, 2020
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	383	November 2, 2021
All nodes are unhealthy Hail Query & hailctl	3	576	November 4, 2020
Dataproc Workers Lost After intensive Task Hail Query & hailctl	32	1985	July 23, 2019
Hailctl breaks with latest GCP SDK 294.0.0 Hail Query & hailctl	2	536	March 11, 2020

How to create a cluster with 8 cpus and 0 preemptible

Related topics