Setting number of preemptible workers in `hailctl dataproc start`

I am writing hailctl dataproc start (cluster name) —p12 and —vep GRCh37

It is not recognozing —p12 to increase the size of the cores

the -p should have a single dash, not a double dash.

The full name is --num-preemptible-workers

Just to import a VCF of 40,000 into a mt, should we specify different memory or different other requirements for cluster or do you suggest just increasing the number of nodes?

No, just increasing the number of nodes should be fine.

I have used the following to launch my cluster:

hailctl dataproc start development --vep GRCh37 --num-preemptible-workers 40

I am running the following now on 350 cores

import hail as hl
import hail.expr.aggregators as agg
import hail.methods
import pandas as pd
from typing import *
import random

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True).write(‘gs://wes_development/’, overwrite=True)
an_g = hl.read_matrix_table(‘gs://wes_development/’)
an_g = hl.vep(an_g, ‘gs://hail-common/vep/vep/vep85-loftee-gcloud.json’)
an_g.write(‘gs://wes_development/’, overwrite=True)

The write command at the end has already taken an hour; but it is still running in GCP. The progress bar shows me the following:
[Stage 1:> (0 + 7) / 7]

I don’t think the progress bar has moved yet. so not sure what’s happening

try loading with min_partitions=256 on import_vcf – this will divide the data into more than 7 chunks, which will both use more than 7 cores, and make progress more evident.

While the default partitioning is generally fine, with VEP it’s good to ensure a higher amount of parallelism.


So should I use the following command on import:

hl.import_vcf(‘gs://wes_development/complete.pheno.n100.vcf.gz’,force_bgz=True,force=True, min_partitions=256).write(‘gs://wes_development/’, overwrite=True)

yes, try that. However, remove the force option – this luckily isn’t getting used since you use force_bgz, but it’s quite dangerous.