Hi!
I am running hail on a google cloud dataproc cluster and it seems to be taking a lot longer than I expected. Here are the data proc parameters I’m running on:
--master-machine-type=n1-highmem-64
--master-boot-disk-size 10000
and the auto-scaling policy:
basicAlgorithm:
cooldownPeriod: 120s
yarnConfig:
gracefulDecommissionTimeout: 0s
scaleDownFactor: 1.0
scaleUpFactor: 1.0
secondaryWorkerConfig:
maxInstances: 20000
weight: 1
workerConfig:
maxInstances: 10
minInstances: 2
weight: 1
The exact hail command I’m running is run_combiner
using as input 10k genome gVCFs :
hl.experimental.run_combiner(
my_gvcf_list,
...
key_by_locus_and_alleles=True,
reference_genome='GRCh38',
use_genome_default_intervals=True,
target_records=10000
)
From the dataproc cluster logs, I can see the hail progress bar is slowly progressing:
[Stage 0:===========> (56896 + 80) / 253428]
However, this job has been running for 115 hours so far and only being ~20% complete is slower than I expected.
Is there a way to speed this job up by changing any parameters or am I overestimating how fast this should take?