Hail on gcloud dataproc cluster runtime issues

Hi!

I am running hail on a google cloud dataproc cluster and it seems to be taking a lot longer than I expected. Here are the data proc parameters I’m running on:
--master-machine-type=n1-highmem-64
--master-boot-disk-size 10000
and the auto-scaling policy:

basicAlgorithm:
  cooldownPeriod: 120s
  yarnConfig:
    gracefulDecommissionTimeout: 0s
    scaleDownFactor: 1.0
    scaleUpFactor: 1.0
secondaryWorkerConfig:
  maxInstances: 20000
  weight: 1
workerConfig:
  maxInstances: 10
  minInstances: 2
  weight: 1

The exact hail command I’m running is run_combiner using as input 10k genome gVCFs :

    hl.experimental.run_combiner(
        my_gvcf_list,
        ...
        key_by_locus_and_alleles=True,
        reference_genome='GRCh38',
        use_genome_default_intervals=True,
        target_records=10000
    )

From the dataproc cluster logs, I can see the hail progress bar is slowly progressing:

[Stage 0:===========>                                     (56896 + 80) / 253428]

However, this job has been running for 115 hours so far and only being ~20% complete is slower than I expected.
Is there a way to speed this job up by changing any parameters or am I overestimating how fast this should take?

Hey @iwong ,

How many secondary workers do you currently have? It appears from the progress bar that you only have 80 active cores. Are you using n1-standard-8’s as primary workers? It seems likely that you’ve got exactly 10 n1-standard-8’s for a total of 80 cores. I’m not familiar with the auto scaling policies, but you should be able to explicitly request more secondary workers which should substantially improve wall clock time.

I think I’m using n1-standard-8s (or at least n1-standards) as workers. The dataproc google console page showed 1,260 worker nodes active so I assumed they were all actively computing. I’ll try requesting more workers, thank you!

A couple things –

  1. 20000 secondary workers is way too many – Spark can’t manage clusters that big. I think really the max you should ever use is a few thousand, but we’re seeing cryptic networking failures above ~500 nodes these days which we’re digging into.

  2. if you see 1260 worker nodes active and there are only 80 cores working on tasks, something may have gone wrong. Do you have a few minutes between 2-3pm today for me to walk you through accessing the Spark UI to look at cluster health metrics?

Thank you! Yes I am free during that time