Spark tuning for Elasticsearch export?


Just FYI that this is low priority and we’re not blocked by anything:

We have had some headaches exporting heavily-annotated VCFs from Hail to Elasticsearch - during the export, Yarn shuts down a container for using too much memory, with a standard ERR_SPARK_FAILED_YARN_KILLED_MEMORY error and stack trace. Following Google/StackOverflow, upping the spark.executor.memoryOverhead to something more like 15-20% of the spark.executor.memory value appears to solve the problem. I have started using 20% without any appreciable impact on the executors during the analysis tasks.

Very naively this makes some sense to me: for complex documents it’s possible to exceed the default 10% memoryOverhead that Spark reserves to other JVM processes, namely the Elasticsearch-connector. But I am not sure about any of this; all I did was Google and try out different Spark configurations. Changing the memory overhead makes me a bit nervous, so I just want to check that this is a reasonable change to the default cluster configuration for a workflow like this, and that there aren’t any obvious negative impacts I’m missing.

This is Hail 0.2, 150 cores at 27g, 4 cpu. Before the export I persisted the tables - could this be a problem? During the analysis tasks, I didn’t notice “too many” unsorted shuffle warnings (given that I am doing a lot of joins), ugly GC pauses, huge resource spikes, etc. Regardless I would have expected that to be cleaned up during persistence.

Pleased to report that setting spark.dynamicAllocation.enabled = false fixed the Elasticsearch slowness/crashing problem without me having to fiddle with other memory settings. It seems that by default AWS EMR will turn this on, and (along with who knows what other AWS oddness) it caused problems in the export. Thanks to @pavlos for the tip (Small MatrixTable hangs on write into Google bucket).

1 Like