I am running hail using HPC cluster (using 4 nodes on the MPI queue).
I am setting tmp_dir within hl.init() function:
hl.init(log='katia.log',
tmp_dir='/my/path/tmp',
spark_conf={
'spark.local.dir': '/my/path/tmp',
'spark.executor.extraJavaOptions': '-Djava.io.tmpdir=/my/path/tmp',
'spark.driver.extraJavaOptions': '-Djava.io.tmpdir=/my/path/tmp'
},
default_reference='GRCh38'
)
and set java and spark env variables in my submission script.
I further check that they are indeed are set up correctly within my python script:
# Get `java.io.tmpdir` from the JVM system properties
java_tmp_dir = spark.sparkContext._jvm.java.lang.System.getProperty("java.io.tmpdir")
# Get `spark.local.dir` from Spark configuration
spark_local_dir = spark.conf.get("spark.local.dir", "Not Set")
# Print the values
print("=============================================")
print(f"java.io.tmpdir: {java_tmp_dir}")
print(f"spark.local.dir: {spark_local_dir}")
print("=============================================")
And the output is correct (pointing to my local tmp
directory).
I also see that when the first part of my script executes, it does indeed uses my local tmp
directory and many spark files and folders are written there.
However, when I call export_hail()
function, it uses /tmp
directory and the program aborts with “no space left on device” error message:
Traceback (most recent call last):
File "/my/path/hail_code.py", line 112, in <module>
hl.export_plink(mt, plinkout, fam_id=mt.s, ind_id=mt.s)
File "<decorator-gen-1336>", line 2, in export_plink
^M[Stage 2:==> (36 + 127) / 768]^M File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in
wrapper
return __original_func(*args_, **kwargs_)
File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/methods/impex.py", line 387, in export_plink
Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 104, in execute
self._handle_fatal_error_from_backend(e, ir)
File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
raise err
File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
result_tuple = self._jbackend.executeEncode(jir, stream_codec)
File "/install/path/spark/3.1.2/install/spark-3.1.2-bin-scc-spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/install/path/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 31, in deco
raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: FileSystemException: /tmp/spark-59f4e44c-746a-41a2-a098-0207f20e3a0b/executor-7c9f0afd-d09f-4ad0-ac1d-660de2c582d4/blockmgr-a23f8f56-32e1-41ed-bff4-899e4e472106/19
: No space left on device
When I look at what stored in /tmp directory, I see very large files that looks like:
Path:
/tmp/spark-f0b19bb2-9fbc-4f9c-928b-301113ea2798/executor-2ef64b92-6a2c-4aaa-a2a9-0f24f249935d/spark-39681369-6804-45cc-933d-e53fc620665e
rw-r--r-- 1 ktrn scv 153031135 Apr 26 20:13 -13925546521745712777804_cache
-rw-r--r-- 1 ktrn scv 0 Apr 26 20:13 -13925546521745712777804_lock
What can I do to fix this problem? I tried various versions of Hail and this problem exists in all of them.
Thank you