We encountered an issue of running out of spark executor memory when running sample_qc() from a very big VCF file. I have been playing with changing spark executor cores and memory setting and restarting hail. My question is: 1)Is there a way to change those spark executor settings on the fly? 2) what settings would you recommend to use to load big VCF of thousands of samples and run analysis like sample_qc, variant_qc for all samples?
One other thing to check - does the info output say something like INFO: coerced sorted dataset or does it say INFO: ordering dataset with network shuffle?
Unfortunately I lost the session of the complete error message. I remember the final message complains executor memory over limit and recommend boosting of executor overhead message. We got around similar error message with increasing executor memory but mainly want to know how your group set up for running VCF with thousands of samples. Our cluster has 9 working nodes, 3 edge nodes with 128GB memory allocated to each node.
is it possible to dynamically boost executor memory?
do you have a recommended default partition size (out current set up is 128MB).
Everything looks perfectly fine with your setup. I think if the VCF is ordered unexpectedly and triggering a shuffle, that could be the source of the memory overhead.
Most local Hail users use Google Dataproc, which works very well for most non-shuffle operations*. They’re running on hundreds of thousands of samples here, so a few thousand should be no problem.
*Shuffles are bad on the cloud due to OOMs and node pre-emption. We’re thinking about possible fixes.
Thanks tpoterba!
Can I do a vcf-sort to solve this problem? Do the chromosomes need to be in specific order?
Also, we are running Hail on the developmental version due to the no PL issue