Google dataproc machine type choose for huge dataset

shuang · August 4, 2020, 2:35pm

Dear hail team,

I have a really huge data set. (import vcf to vds and write to google bucket).

I noticed in example doc, you recommend n1-highmem-8 (32,96…) machine type.
I find it still report memory/GC issue while I use n1-highmem-96 machine and added off-heap memory.
With increasing of machine mem and off-heap mem (both driver and executor), it do finished more task. but still fail at last stage or stuck at certain task at final stage.
Fail happened during writing data.

I want to know for extremely big data set, will you guys recommend change to m1-ultramem-40 (m1-ultramem-80) or should I added bigger worker local disk. I notice last stage, via Dataproc log, shuffle occurred. Also I hear some suggestion about add more workers.

and for GC, Xss is 4M now, maybe I also need to adjust it?
Thanks a lot for your time and appreciate any suggestions

tpoterba · August 4, 2020, 2:48pm

Which doc is this?

We recommend using Google Dataproc with a larger number of small (8-core) virtual machines in a cluster. Not all hardware resources scale with the number of cores in a single machine (in particular, network bandwidth), so your overall performance will be improved by using a large number of small machines compared to a small number of large machines.

Which issue is this? What version of Hail are you using, and what’s the pipeline?

Topic		Replies	Views
"Hail off-heap memory exceeded maximum threshold" error on large analysis job Hail Query & hailctl	1	305	April 18, 2023
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	383	November 2, 2021
Using Hail on the Google Cloud Platform Help [0.1]	18	14014	September 14, 2017
Trouble with vcf_combiner on gcloud dataproc cluster Hail Query & hailctl	2	326	October 22, 2021
Google cloud speed up Hail Query & hailctl	10	847	September 18, 2019

Google dataproc machine type choose for huge dataset

Related topics