Just FYI that this is low priority and we’re not blocked by anything:
We have had some headaches exporting heavily-annotated VCFs from Hail to Elasticsearch - during the export, Yarn shuts down a container for using too much memory, with a standard ERR_SPARK_FAILED_YARN_KILLED_MEMORY error and stack trace. Following Google/StackOverflow, upping the
spark.executor.memoryOverhead to something more like 15-20% of the
spark.executor.memory value appears to solve the problem. I have started using 20% without any appreciable impact on the executors during the analysis tasks.
Very naively this makes some sense to me: for complex documents it’s possible to exceed the default 10% memoryOverhead that Spark reserves to other JVM processes, namely the Elasticsearch-connector. But I am not sure about any of this; all I did was Google and try out different Spark configurations. Changing the memory overhead makes me a bit nervous, so I just want to check that this is a reasonable change to the default cluster configuration for a workflow like this, and that there aren’t any obvious negative impacts I’m missing.
This is Hail 0.2, 150 cores at 27g, 4 cpu. Before the export I persisted the tables - could this be a problem? During the analysis tasks, I didn’t notice “too many” unsorted shuffle warnings (given that I am doing a lot of joins), ugly GC pauses, huge resource spikes, etc. Regardless I would have expected that to be cleaned up during persistence.