I am using Hail 0.2 to run VEP, and have added
"--fork", "6", to my vep config .json. I have not done a lot of testing with VEP by itself but my understanding is that the fork option is generally recommended on modern processors. I could not even tell you if running VEP through Hail is actually faster with --fork - it seems like it is? It doesn’t crash too often, at least But I have not had a lot of time/resources for performance tuning.
I am wondering if anyone here has experience doing this. It does not seem to be used very widely in the Hail community (based on the other VEP configs I’ve seen) but I haven’t checked extensively. Unfortunately I am not being very careful about executor CPU count vs. VEP CPU count - is this something I should be more concerned about?
It is also a bit opaque to me how this affects memory considerations. I am using yarn, and have had to twiddle quite a few settings to get --fork 6 to reliably work on larger datasets (oddly --fork 6 works reliably out-of-the-box on a single-sample VCF with a small cluster). Specifically I had to manually override the
spark.executor.memoryOverhead default of 10% of executor memory so that yarn doesn’t shut down the container during VEP execution. I think VEP (considered by the JVM to be native) runs exclusively in that memoryOverhead, so if VEP is being multithreaded within a container then increasing the limit makes sense. 25% seems to work but maybe that’s too much.
Anyway, is there any general guidance about this?