How to create a cluster with 8 cpus and 0 preemptible

Assuming this is possible, of course.
I’m trying to build a cluster, where I’m limited to 8 cpus (free trial).
I don’t want any --num-preemptible-workers and I understand that the minimal number of workers is 2. (BTW, is master = workers?)

So why when I try something like:

hailctl dataproc start vep-hail --vep GRCh37 --region europe-west2 --master-machine-type=n1-highmem-4

I got this:

Pulling VEP data from bucket in uk.
gcloud dataproc clusters create vep-hail \
    --image-version=1.4-debian9 \
    --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=20g \
    --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.39/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.39/vep-GRCh37.sh \
    --metadata=^|||^VEP_REPLICATE=uk|||WHEEL=gs://hail-common/hailctl/dataproc/0.2.39/hail-0.2.39-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1 \
    --master-machine-type=n1-highmem-4 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-preemptible-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=2 \
    --preemptible-worker-boot-disk-size=200GB \
    --worker-boot-disk-size=200GB \
    --worker-machine-type=n1-highmem-8 \
    --region=europe-west2 \
    --initialization-action-timeout=20m \
    --labels=creator=alanwilter_gmail_com
Starting cluster 'vep-hail'...
WARNING: The `--num-preemptible-workers` flag is deprecated. Use the `--num-secondary-workers` flag instead.
WARNING: The `--preemptible-worker-boot-disk-size` flag is deprecated. Use the `--secondary-worker-boot-disk-size` flag instead.
ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Multiple validation errors:
 - Insufficient 'CPUS' quota. Requested 20.0, available 8.0.
 - Insufficient 'CPUS_ALL_REGIONS' quota. Requested 20.0, available 12.0.
 - This request exceeds CPU quota. Some things to try: request fewer workers (a minimum of 2 is required), use smaller master and/or worker machine types (such as n1-standard-2).
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 108, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/start.py", line 346, in main
    sp.check_call(cmd)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'clusters', 'create', 'vep-hail', '--image-version=1.4-debian9', '--properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=false|||dataproc:dataproc.monitoring.stackdriver.enable=false|||spark:spark.driver.memory=20g', '--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.39/init_notebook.py,gs://hail-common/hailctl/dataproc/0.2.39/vep-GRCh37.sh', '--metadata=^|||^VEP_REPLICATE=uk|||WHEEL=gs://hail-common/hailctl/dataproc/0.2.39/hail-0.2.39-py3-none-any.whl|||PKGS=aiohttp>=3.6,<3.7|aiohttp_session>=2.7,<2.8|asyncinit>=0.2.4,<0.3|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|humanize==1.0.0|hurry.filesize==0.9|nest_asyncio|numpy<2|pandas>0.24,<0.26|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|tqdm==4.42.1', '--master-machine-type=n1-highmem-4', '--master-boot-disk-size=100GB', '--num-master-local-ssds=0', '--num-preemptible-workers=0', '--num-worker-local-ssds=0', '--num-workers=2', '--preemptible-worker-boot-disk-size=200GB', '--worker-boot-disk-size=200GB', '--worker-machine-type=n1-highmem-8', '--region=europe-west2', '--initialization-action-timeout=20m', '--labels=creator=alanwilter_gmail_com']' returned non-zero exit status 1.

According to GCP doc n1-highmem-4 or n1-standard-4 have 4 vCPUs, so I should have a cluster with 2 workers with 4 vCPUs each, hence total 8 vCPUs, but hailctl dataproc ... command is asking for 20!

Any help here please? Many thanks in advance.

Hmm… I think I know what mess I’m doing… I’m confusing master x work x preemptible.

What I was hoping actually is just one computer (node), a master, no workers, where the master would have 8 cpus and do the whole work. But this is not possible with hailctl, right? Is it a limitation of hailctl or is at yarn/spark level (I’m new to this kind of clusters, sorry).

There is always only a single master. The master is the computer that the python interpreter is installed on. Any non-hail python computing that you do will happen there.

Workers and preemptible-workers are almost the same thing. They are both additional computers in your cluster that get assigned work by the master. In hail, the majority of your computation will take place on the workers. This includes all of the VEP work. The main difference is that preemptible workers are that transient. Sometimes Google will take one of them away from you mid computation if demand from other users is high. You only pay for the machines you have at any given time though, so you stop paying for it once it’s taken from you. Preemptible workers are also significantly cheaper than regular ones.

You can specify number of regular workers with --num-workers 4 if you want 4 of them. You always need at least 2 regular workers though.

You should read this for a more detailed overview than I’ve given here. I think it will help a lot: https://github.com/danking/hail-cloud-docs/blob/master/how-to-cloud.md

To answer your question though: you cannot do what you’re asking for (only one computer with 8cpus) with hailctl dataproc. hailctl dataproc creates clusters where most of the computation (including VEP) is done on the workers.

I suspect you have good reasons to avoid paying for Dataproc but you might check the costs (https://cloud.google.com/products/calculator#id=e241059a-556f-473b-a9dc-b550afad1a13). Running a minimal cluster costs a couple bucks an hour.

hailctl dataproc works by calling the Google gcloud command to create and work with Dataproc clusters. Dataproc supports single-node clusters with the --single-node option. This isn’t currently exposed in hailctl. But you can run hailctl dataproc create --dry-run args... to see the gcloud command hailctl would run. Then you can modify that command to add --single-node and run it yourself.

Thank you all guys!

@johnc1231 very detailed explanation, I really appreciated that.

@danking We have AWS account. I forked Hail-on-aws-spot, did several modifications and started to get VEP working there but I stopped short as VEP is very time consuming to install. Now that I learned how you guys use docker, I may explore a similar solution in the future. However, so far, I’m finding GCP cheaper than AWS! I need to investigate this properly.

@cseed Yeap, I’ve looking into this and now that you confirmed my suspicions I will give a try eventually.