Setting number of preemptible workers in `hailctl dataproc start`

Danish436 · October 2, 2019, 3:37pm

I am writing hailctl dataproc start (cluster name) —p12 and —vep GRCh37

It is not recognozing —p12 to increase the size of the cores

tpoterba · October 2, 2019, 3:40pm

the -p should have a single dash, not a double dash.

The full name is --num-preemptible-workers

Danish436 · October 2, 2019, 5:30pm

Just to import a VCF of 40,000 into a mt, should we specify different memory or different other requirements for cluster or do you suggest just increasing the number of nodes?

tpoterba · October 2, 2019, 5:32pm

No, just increasing the number of nodes should be fine.

Danish436 · October 2, 2019, 8:11pm

I have used the following to launch my cluster:

hailctl dataproc start development --vep GRCh37 --num-preemptible-workers 40

I am running the following now on 350 cores

import hail as hl
import hail.expr.aggregators as agg
import hail.methods
import pandas as pd
from typing import *
import random
hl.init()

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True).write(‘gs://wes_development/pipeline_file.mt’, overwrite=True)
an_g = hl.read_matrix_table(‘gs://wes_development/pipeline_file.mt’)
an_g = hl.vep(an_g, ‘gs://hail-common/vep/vep/vep85-loftee-gcloud.json’)
an_g.describe()
print(‘writing’)
an_g.write(‘gs://wes_development/genotype_annotations_new.mt’, overwrite=True)

The write command at the end has already taken an hour; but it is still running in GCP. The progress bar shows me the following:
[Stage 1:> (0 + 7) / 7]

I don’t think the progress bar has moved yet. so not sure what’s happening

tpoterba · October 2, 2019, 8:14pm

try loading with min_partitions=256 on import_vcf – this will divide the data into more than 7 chunks, which will both use more than 7 cores, and make progress more evident.

While the default partitioning is generally fine, with VEP it’s good to ensure a higher amount of parallelism.

Danish436 · October 2, 2019, 8:16pm

Thanks.

So should I use the following command on import:

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True, min_partitions=256).write(‘gs://wes_development/pipeline_file.mt’, overwrite=True)

tpoterba · October 2, 2019, 8:17pm

yes, try that. However, remove the force option – this luckily isn’t getting used since you use force_bgz, but it’s quite dangerous.

alanwilter · May 7, 2020, 8:57am

And what about mt.repartition once you had done import_vcf rather than building again another cluster, if short of only 7 cores?

BTW, did last reply from @tpoterba work for you @Danish436?

I’m preparing a similar project to run vep on 80Gb vcf.gz file, so I’ve very interested in this topic and any similar you may point to me.

Thanks, Alan

danking · May 7, 2020, 3:23pm

Hi @alanwilter,

You should avoid using repartition. Repartition performs a “shuffle” which requires non-preemptible workers and is failure prone. I think you might find my overview of efficiently using Hail helpful. You do not need to stop and start your cluster just to change the number of cores. You can dynamically change the cluster size.

You called your file a “80Gb vcf.gz file”. Is your file really gzipped and not bgzipped? gzipped files cannot be read in parallel, so Hail will be extremely slow and not use your worker nodes. If it your VCF is gzipped, tou should decompress that file and re-compress it as a bgzipped file.

If your file is actually bgzipped but uses the file extension “gz,” then force_bgz=True will import the file in parallel.

alanwilter · May 7, 2020, 5:34pm

Many thanks @danking. Indeed, I am doing as above since my colleague who created the file guaranteed me it is bgzipped.

alanwilter · May 7, 2020, 5:35pm

And yes, I’ve been reading your docs a lot lately

Topic		Replies	Views
How to create a cluster with 8 cpus and 0 preemptible Hail Query & hailctl	6	1415	May 10, 2020
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	378	November 2, 2021
Dataproc Workers Lost After intensive Task Hail Query & hailctl	32	1979	July 23, 2019
All nodes are unhealthy Hail Query & hailctl	3	573	November 4, 2020
Google cloud speed up Hail Query & hailctl	10	845	September 18, 2019

Setting number of preemptible workers in `hailctl dataproc start`

Related topics