We encountered an issue of running out of spark executor memory when running sample_qc() from a very big VCF file. I have been playing with changing spark executor cores and memory setting and restarting hail. My question is: 1)Is there a way to change those spark executor settings on the fly? 2) what settings would you recommend to use to load big VCF of thousands of samples and run analysis like sample_qc, variant_qc for all samples?
Is your pipeline something like
vds = hc.import_vcf(...)\
or is it something more complicated? How many executors are you using, and how much memory is allocated to each?
The last failed one I used executor.cores=4, executor.memory=32g
Could you post the error message that propagated through to the driver?
One other thing to check - does the info output say something like
INFO: coerced sorted dataset or does it say
INFO: ordering dataset with network shuffle?
Unfortunately I lost the session of the complete error message. I remember the final message complains executor memory over limit and recommend boosting of executor overhead message. We got around similar error message with increasing executor memory but mainly want to know how your group set up for running VCF with thousands of samples. Our cluster has 9 working nodes, 3 edge nodes with 128GB memory allocated to each node.
is it possible to dynamically boost executor memory?
do you have a recommended default partition size (out current set up is 128MB).
Everything looks perfectly fine with your setup. I think if the VCF is ordered unexpectedly and triggering a shuffle, that could be the source of the memory overhead.
Most local Hail users use Google Dataproc, which works very well for most non-shuffle operations*. They’re running on hundreds of thousands of samples here, so a few thousand should be no problem.
*Shuffles are bad on the cloud due to OOMs and node pre-emption. We’re thinking about possible fixes.
Can I do a vcf-sort to solve this problem? Do the chromosomes need to be in specific order?
Also, we are running Hail on the developmental version due to the no PL issue
The devel version could totally be more memory-hungry right now. Did you try running 0.1 with
Thanks tpoterba! We will give import_vcf(‘file.vcf.bgz’, store_gq=True) a try on 0.1.
Just in case, we are using hg19 as the reference and chrM is the first contig, could it possibly trigger hail do heavy shuffling?
Yes, that definitely could cause a shuffle. The best way to see is to run the import and look for the log output.
Tim - What is a typical input size/ shuffle write size ratio you see?
Hard to say – should be about 1:1.