Hi Hail team,
I tried to run run_combiner() to joint-call my 366 WGS gvcf, which already subset by chromosomes. But we cannot run cluster mode on our server. I’m wondering how can I fine tuning the parameter to speed up our process?
The resource I requested as below:
#!/bin/bash
#BSUB -n 24
#BSUB -R 'rusage[mem=256GB]'
...
python test_hail_jointCalling.py
And the Python Script are like below:
import os
import glob
import time
import pyspark
import hail as hl
start_time = time.time()
threads = int(os.environ['LSB_MAX_NUM_PROCESSORS']) - 2
hail_jars = "/opt/conda/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar"
conf = pyspark.SparkConf().setAll([
('spark.master', 'local[{}]'.format(threads)),
('spark.app.name', 'Hail'),
('spark.jars', str(hail_jars)),
('spark.driver.extraClassPath', str(hail_jars)),
('spark.executor.extraClassPath', './hail-all-spark.jar'),
('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
])
### Using sc:
sc = pyspark.SparkContext(conf=conf)
hl.init(default_reference='GRCh38',sc=sc)
### Input:
project_folder = "/data/WGS_2021May"
splitGVCF_Folder = project_folder + "/testHail_JointCalling/chr5_gvcf"
inputs = glob.glob(splitGVCF_Folder+"/*_germline.g.vcf.bgz")
### Output:
output_folder = "/data/fup/test_pyspark_02"
temp_folder = output_folder + "/temp"
os.makedirs(temp_folder, mode=777, exist_ok=True)
output_file = output_folder + "/test_hail_run_combiner_chr5.mt"
hl.experimental.run_combiner(inputs,
use_genome_default_intervals=True,
out_file=output_file,
tmp_path=temp_folder,
overwrite=True,
reference_genome='GRCh38')
processing_time_in_seconds = time.time() - start_time
print("--- %s minute ---" % (processing_time_in_seconds/60))
hl.stop()
And here are some Spark UI screenshot
Is there anything I can try such as add more driver memory or core?
The reason I want to ask is because I find out the max memory usage is only 2.6GB. I request 256GB memory, but it seems not used. So maybe there’s a way I can fine tune the memory config or simply run more data at same time to speed up. Otherwise, it already took 45+ hours to do joint-calling, for same dataset it only took 5 hours using GLnexus on same server.
Thanks!
Po-Ying