Spark 2.4.4 gets stuck in initialization phase

Hi,

I am working on ‘RHEL Server release 7.7’. I installed the recommended ‘miniconda3’, then created python 3.7 virtual environment:

conda create -n py37 python=3.7

Then I installed hail using pip the way it is recommended:

pip install hail

py-spark and ipython:

conda install -c conda-forge pyspark
conda install -c anaconda ipython

I have also downloaded ‘Spark2.4.4’, started one master and one slave. Then, I tried to run the basic script in various ways:

import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt.count()

But it just gets stuck on the second line in Hail and Spark init and does not go anywhere further. No log output, no error.

Whether I set $SPARK_HOME or not, it does not fix it. I also set the path to HAIL jar directly when running it with spark-submit but the result is the same:

spark-submit --master spark://ai-grisnodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar test_hail.py

Or

spark-submit --master spark://ai-grisnodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar test_hail.py

test_hail.py just contains the 3 lines of the sample code.

Can you run Spark pipelines that don’t involve Hail?

I launched spark-shell, loaded a file from Hadoop and counted the number of lines in it, working fine, or how else could I test it?

Ok, we figured it out. The issue was that we do not have that much space on the partition where Hadoop and spark logs were written. After redirection of the logs to a different place on the node, it started to work. There was also a ‘java cp’ process that was triggered by the ‘spark-submit’ and it was filling up Hadoop log with one error over and over again:

Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resource s, the NN will immediately return to safe mode. Use “hdfs dfsadmin -safemode leave” to turn safe mode off.
1808697 2019-11-19 16:22:20,204 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs fro m 137.187.60.61:44398 Call#33705121 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hail.wDXS6L3AD3Ta. Name node is in safe mode.

It is still rather vague what was the issue, but that is just most relevant.

huh, weird! Glad you’re unblocked.