Setting number of preemptible workers in `hailctl dataproc start`

I am writing hailctl dataproc start (cluster name) —p12 and —vep GRCh37

It is not recognozing —p12 to increase the size of the cores

the -p should have a single dash, not a double dash.

The full name is --num-preemptible-workers

Just to import a VCF of 40,000 into a mt, should we specify different memory or different other requirements for cluster or do you suggest just increasing the number of nodes?

No, just increasing the number of nodes should be fine.

I have used the following to launch my cluster:

hailctl dataproc start development --vep GRCh37 --num-preemptible-workers 40

I am running the following now on 350 cores

import hail as hl
import hail.expr.aggregators as agg
import hail.methods
import pandas as pd
from typing import *
import random
hl.init()

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True).write(‘gs://wes_development/pipeline_file.mt’, overwrite=True)
an_g = hl.read_matrix_table(‘gs://wes_development/pipeline_file.mt’)
an_g = hl.vep(an_g, ‘gs://hail-common/vep/vep/vep85-loftee-gcloud.json’)
an_g.describe()
print(‘writing’)
an_g.write(‘gs://wes_development/genotype_annotations_new.mt’, overwrite=True)

The write command at the end has already taken an hour; but it is still running in GCP. The progress bar shows me the following:
[Stage 1:> (0 + 7) / 7]

I don’t think the progress bar has moved yet. so not sure what’s happening

try loading with min_partitions=256 on import_vcf – this will divide the data into more than 7 chunks, which will both use more than 7 cores, and make progress more evident.

While the default partitioning is generally fine, with VEP it’s good to ensure a higher amount of parallelism.

Thanks.

So should I use the following command on import:

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True, min_partitions=256).write(‘gs://wes_development/pipeline_file.mt’, overwrite=True)

yes, try that. However, remove the force option – this luckily isn’t getting used since you use force_bgz, but it’s quite dangerous.

And what about mt.repartition once you had done import_vcf rather than building again another cluster, if short of only 7 cores?

BTW, did last reply from @tpoterba work for you @Danish436?

I’m preparing a similar project to run vep on 80Gb vcf.gz file, so I’ve very interested in this topic and any similar you may point to me.

Thanks, Alan

Hi @alanwilter,

You should avoid using repartition. Repartition performs a “shuffle” which requires non-preemptible workers and is failure prone. I think you might find my overview of efficiently using Hail helpful. You do not need to stop and start your cluster just to change the number of cores. You can dynamically change the cluster size.

You called your file a “80Gb vcf.gz file”. Is your file really gzipped and not bgzipped? gzipped files cannot be read in parallel, so Hail will be extremely slow and not use your worker nodes. If it your VCF is gzipped, tou should decompress that file and re-compress it as a bgzipped file.

If your file is actually bgzipped but uses the file extension “gz,” then force_bgz=True will import the file in parallel.

Many thanks @danking. Indeed, I am doing as above since my colleague who created the file guaranteed me it is bgzipped.

And yes, I’ve been reading your docs a lot lately :slight_smile:

2 Likes