Hail on gcloud dataproc cluster runtime issues

iwong · November 2, 2021, 4:16pm

Hi!

I am running hail on a google cloud dataproc cluster and it seems to be taking a lot longer than I expected. Here are the data proc parameters I’m running on:
--master-machine-type=n1-highmem-64
--master-boot-disk-size 10000
and the auto-scaling policy:

basicAlgorithm:
  cooldownPeriod: 120s
  yarnConfig:
    gracefulDecommissionTimeout: 0s
    scaleDownFactor: 1.0
    scaleUpFactor: 1.0
secondaryWorkerConfig:
  maxInstances: 20000
  weight: 1
workerConfig:
  maxInstances: 10
  minInstances: 2
  weight: 1

The exact hail command I’m running is run_combiner using as input 10k genome gVCFs :

    hl.experimental.run_combiner(
        my_gvcf_list,
        ...
        key_by_locus_and_alleles=True,
        reference_genome='GRCh38',
        use_genome_default_intervals=True,
        target_records=10000
    )

From the dataproc cluster logs, I can see the hail progress bar is slowly progressing:

[Stage 0:===========>                                     (56896 + 80) / 253428]

However, this job has been running for 115 hours so far and only being ~20% complete is slower than I expected.
Is there a way to speed this job up by changing any parameters or am I overestimating how fast this should take?

danking · November 2, 2021, 4:22pm

Hey @iwong ,

How many secondary workers do you currently have? It appears from the progress bar that you only have 80 active cores. Are you using n1-standard-8’s as primary workers? It seems likely that you’ve got exactly 10 n1-standard-8’s for a total of 80 cores. I’m not familiar with the auto scaling policies, but you should be able to explicitly request more secondary workers which should substantially improve wall clock time.

iwong · November 2, 2021, 4:27pm

I think I’m using n1-standard-8s (or at least n1-standards) as workers. The dataproc google console page showed 1,260 worker nodes active so I assumed they were all actively computing. I’ll try requesting more workers, thank you!

tpoterba · November 2, 2021, 4:30pm

A couple things –

20000 secondary workers is way too many – Spark can’t manage clusters that big. I think really the max you should ever use is a few thousand, but we’re seeing cryptic networking failures above ~500 nodes these days which we’re digging into.
if you see 1260 worker nodes active and there are only 80 cores working on tasks, something may have gone wrong. Do you have a few minutes between 2-3pm today for me to walk you through accessing the Spark UI to look at cluster health metrics?

iwong · November 2, 2021, 4:32pm

Thank you! Yes I am free during that time

Topic		Replies	Views
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	221	February 21, 2024
Google cloud speed up Hail Query & hailctl	10	847	September 18, 2019
Trouble with vcf_combiner on gcloud dataproc cluster Hail Query & hailctl	2	326	October 22, 2021
Google dataproc machine type choose for huge dataset Hail Query & hailctl	1	373	August 4, 2020
Hail having difficulty scaling to 400K Development	2	42	June 9, 2025

Hail on gcloud dataproc cluster runtime issues

Related topics