Memory issue in Hail

jobaba · September 20, 2017, 4:03pm

We encountered an issue of running out of spark executor memory when running sample_qc() from a very big VCF file. I have been playing with changing spark executor cores and memory setting and restarting hail. My question is: 1)Is there a way to change those spark executor settings on the fly? 2) what settings would you recommend to use to load big VCF of thousands of samples and run analysis like sample_qc, variant_qc for all samples?

Many thanks!

tpoterba · September 20, 2017, 4:16pm

Hi Wendy,
Is your pipeline something like

vds = hc.import_vcf(...)\
  .sample_qc()

or is it something more complicated? How many executors are you using, and how much memory is allocated to each?

jobaba · September 20, 2017, 4:22pm

yes.
vds=hc.import_vcf(…)
vds=vds.sample_qc()

The last failed one I used executor.cores=4, executor.memory=32g

tpoterba · September 20, 2017, 5:40pm

Could you post the error message that propagated through to the driver?

tpoterba · September 20, 2017, 5:41pm

One other thing to check - does the info output say something like INFO: coerced sorted dataset or does it say INFO: ordering dataset with network shuffle?

jobaba · September 20, 2017, 7:56pm

Unfortunately I lost the session of the complete error message. I remember the final message complains executor memory over limit and recommend boosting of executor overhead message. We got around similar error message with increasing executor memory but mainly want to know how your group set up for running VCF with thousands of samples. Our cluster has 9 working nodes, 3 edge nodes with 128GB memory allocated to each node.

is it possible to dynamically boost executor memory?
do you have a recommended default partition size (out current set up is 128MB).

tpoterba · September 20, 2017, 8:00pm

Everything looks perfectly fine with your setup. I think if the VCF is ordered unexpectedly and triggering a shuffle, that could be the source of the memory overhead.

Most local Hail users use Google Dataproc, which works very well for most non-shuffle operations*. They’re running on hundreds of thousands of samples here, so a few thousand should be no problem.

*Shuffles are bad on the cloud due to OOMs and node pre-emption. We’re thinking about possible fixes.

jobaba · September 20, 2017, 8:03pm

Thanks tpoterba!
Can I do a vcf-sort to solve this problem? Do the chromosomes need to be in specific order?
Also, we are running Hail on the developmental version due to the no PL issue

tpoterba · September 20, 2017, 9:04pm

The devel version could totally be more memory-hungry right now. Did you try running 0.1 with import_vcf('file.vcf.bgz', store_gq=True)?

jobaba · September 20, 2017, 9:49pm

Thanks tpoterba! We will give import_vcf(‘file.vcf.bgz’, store_gq=True) a try on 0.1.

Just in case, we are using hg19 as the reference and chrM is the first contig, could it possibly trigger hail do heavy shuffling?

tpoterba · September 20, 2017, 10:31pm

Yes, that definitely could cause a shuffle. The best way to see is to run the import and look for the log output.

jerryliu2005 · September 20, 2017, 10:47pm

Tim - What is a typical input size/ shuffle write size ratio you see?

tpoterba · September 20, 2017, 11:21pm

Hard to say – should be about 1:1.

Topic		Replies	Views
Large scale ingest Hail Query & hailctl	7	693	April 8, 2019
Memory and disk space requirements Hail Query & hailctl	8	660	October 10, 2022
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1099	December 1, 2022
Spark tuning for Elasticsearch export? Hail Query & hailctl	1	634	February 14, 2020
“sparkContext was shut down” while running hail/pyspark on a large dataset Help [0.1]	3	2720	August 7, 2020

Memory issue in Hail

Related topics