Turning run_combiner() performance for Hail local mode

Hi Hail team,

I tried to run run_combiner() to joint-call my 366 WGS gvcf, which already subset by chromosomes. But we cannot run cluster mode on our server. I’m wondering how can I fine tuning the parameter to speed up our process?

The resource I requested as below:

#!/bin/bash
#BSUB -n 24
#BSUB -R 'rusage[mem=256GB]'
...
python test_hail_jointCalling.py

And the Python Script are like below:

import os
import glob
import time
import pyspark
import hail as hl

start_time = time.time()

threads = int(os.environ['LSB_MAX_NUM_PROCESSORS']) - 2
hail_jars = "/opt/conda/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar"
conf = pyspark.SparkConf().setAll([
    ('spark.master', 'local[{}]'.format(threads)),
    ('spark.app.name', 'Hail'),
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ])

### Using sc:
sc = pyspark.SparkContext(conf=conf)
hl.init(default_reference='GRCh38',sc=sc) 

### Input:
project_folder = "/data/WGS_2021May"
splitGVCF_Folder = project_folder + "/testHail_JointCalling/chr5_gvcf"
inputs = glob.glob(splitGVCF_Folder+"/*_germline.g.vcf.bgz")
### Output:
output_folder = "/data/fup/test_pyspark_02"
temp_folder = output_folder + "/temp"
os.makedirs(temp_folder, mode=777, exist_ok=True)
output_file = output_folder + "/test_hail_run_combiner_chr5.mt"

hl.experimental.run_combiner(inputs, 
                             use_genome_default_intervals=True,
                             out_file=output_file, 
                             tmp_path=temp_folder,
                             overwrite=True,
                             reference_genome='GRCh38')

processing_time_in_seconds = time.time() - start_time
print("--- %s minute ---" % (processing_time_in_seconds/60))

hl.stop()

And here are some Spark UI screenshot

Is there anything I can try such as add more driver memory or core?
The reason I want to ask is because I find out the max memory usage is only 2.6GB. I request 256GB memory, but it seems not used. So maybe there’s a way I can fine tune the memory config or simply run more data at same time to speed up. Otherwise, it already took 45+ hours to do joint-calling, for same dataset it only took 5 hours using GLnexus on same server.

Thanks!
Po-Ying

Spark requires a special environment variable to configure JVM method. See this post:

I hope this helps! The executor summary certainly indicates there might be some GC thrashing happening.

1 Like

Hi @tpoterba

Thanks for helping!
When I add driver.memory and executor.memory to the Conf, the running time is significant drop from almost 60 hours to 3.25 hours.

Here is the changes I made

conf = pyspark.SparkConf().setAll([
    ('spark.master', 'local[{}]'.format(threads)),
    ('spark.app.name', 'Hail'),
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ### https://discuss.hail.is/t/turning-run-combiner-performance-for-hail-local-mode/2318
    ('spark.driver.memory', '10g'),
    ('spark.executor.memory', '10g'),
    ])

Really appreciate your help!

Best,
Po-Ying