Running Hail using MPI queue

ktrn · April 8, 2025, 12:37am

Our users run Hail using our HPC MPI queues.
We have several MPI queues. One queue with 64 cores per node (each node has 512GB of RAM). We also have an MPI queue with 28 cores per node (each node has 256 GB of memory). Sometimes, when our users submit their jobs, they get “OutOfMemoryError: Java heap space” errors. I am trying to come up with some guidance on how to set the following spark parameters to avoid this memory problem:

   --num-executors 
    --executor-cores 
    --driver-memory 
    --executor-memory

Could you help me understand how I should select appropriate values for the above parameters to avoid the memory problem?

I also noticed that in the init() function one can set min_block_size parameter. Would it affect the memory usage?
Thank you

ehigham · April 8, 2025, 3:59pm

Hi @ktrn,

Unfortunately there’s no one-size-fits-all solution when tweaking these parameters; it often needs to be on a case-by-case basis. We’ve configured a list of defaults for google dataproc which you could use as a general pointer (link below) by comparing the specifications of your HPC nodes with google’s instances.

Generally speaking, if your users are doing lots of joins or explodes, more worker memory is good. If their pipelines are very long or there are lots of partitions or columns in their data, more executor memory might be beneficial.

After doing a little digging, min_block_size is an alias for the spark configuration parameter spark.hadoop.mapreduce.input.fileinputformat.split.minsize which affects the minimum block size hdfs uses for reads. By default, hail uses 1MB blocks so unless you’ve configured this to something very large or users are processing thousands of joins in the same pipeline, I wouldn’t have thought it would affect memory consumption too much.

I hope this helps,

github.com/hail-is/hail

hail/python/hailtop/hailctl/dataproc/start.py

952ae203d


      
          DEFAULT_PROPERTIES = {
              "spark:spark.task.maxFailures": "20",
              "spark:spark.driver.extraJavaOptions": "-Xss4M",
              "spark:spark.executor.extraJavaOptions": "-Xss4M",
              'spark:spark.speculation': 'true',
              "hdfs:dfs.replication": "1",
              'dataproc:dataproc.logging.stackdriver.enable': 'false',
              'dataproc:dataproc.monitoring.stackdriver.enable': 'false',
          }
          
          # leadre (master) machine type to memory map, used for setting
          # spark.driver.memory property
          MACHINE_MEM = {
              'n1-standard-1': 3.75,
              'n1-standard-2': 7.5,
              'n1-standard-4': 15,

Topic		Replies	Views
Set up spark.driver.memory in hail0.2 via hailctl Hail Query & hailctl	3	601	August 7, 2020
Java Heap Space out of memory Hail Query & hailctl	5	3740	August 10, 2020
Getting java heap error tried a bunch of things with the executor and memory settings Hail Batch & General Cloud	2	3490	October 5, 2022
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	713	January 21, 2020
How to overcome memory issue? Hail Query & hailctl	4	741	February 26, 2022

Running Hail using MPI queue

Related topics