Using --fork on VEP?

Hello,

I am using Hail 0.2 to run VEP, and have added "--fork", "6", to my vep config .json. I have not done a lot of testing with VEP by itself but my understanding is that the fork option is generally recommended on modern processors. I could not even tell you if running VEP through Hail is actually faster with --fork - it seems like it is? It doesn’t crash too often, at least :slight_smile: But I have not had a lot of time/resources for performance tuning.

I am wondering if anyone here has experience doing this. It does not seem to be used very widely in the Hail community (based on the other VEP configs I’ve seen) but I haven’t checked extensively. Unfortunately I am not being very careful about executor CPU count vs. VEP CPU count - is this something I should be more concerned about?

It is also a bit opaque to me how this affects memory considerations. I am using yarn, and have had to twiddle quite a few settings to get --fork 6 to reliably work on larger datasets (oddly --fork 6 works reliably out-of-the-box on a single-sample VCF with a small cluster). Specifically I had to manually override the spark.executor.memoryOverhead default of 10% of executor memory so that yarn doesn’t shut down the container during VEP execution. I think VEP (considered by the JVM to be native) runs exclusively in that memoryOverhead, so if VEP is being multithreaded within a container then increasing the limit makes sense. 25% seems to work but maybe that’s too much.

Anyway, is there any general guidance about this?

I would expect this to slightly degrade performance, actually – Hail runs VEP in parallel already, launching VEP processes on each worker CPU. Since it’s already parallelized (and all CPUs are already busy), there’s little benefit to using additional threads.

Also, I hear you about the memory management pain. It’s not easy to configure this stuff.

Thanks! That makes sense - I thought maybe Hail was running one VEP process per executor but good to know that’s not the case. I will leave the option off by default.