I am running hail on a google cloud dataproc cluster and it seems to be taking a lot longer than I expected. Here are the data proc parameters I’m running on:
and the auto-scaling policy:
basicAlgorithm: cooldownPeriod: 120s yarnConfig: gracefulDecommissionTimeout: 0s scaleDownFactor: 1.0 scaleUpFactor: 1.0 secondaryWorkerConfig: maxInstances: 20000 weight: 1 workerConfig: maxInstances: 10 minInstances: 2 weight: 1
The exact hail command I’m running is
run_combiner using as input 10k genome gVCFs :
hl.experimental.run_combiner( my_gvcf_list, ... key_by_locus_and_alleles=True, reference_genome='GRCh38', use_genome_default_intervals=True, target_records=10000 )
From the dataproc cluster logs, I can see the hail progress bar is slowly progressing:
[Stage 0:===========> (56896 + 80) / 253428]
However, this job has been running for 115 hours so far and only being ~20% complete is slower than I expected.
Is there a way to speed this job up by changing any parameters or am I overestimating how fast this should take?