[Hail on apache spark] Using pyspark, py4j.protocol.Py4JError

Hi, I’m studying Hail and installing Hail on spark.

I have plan that run GWAS about 1000 genomes. So, I install and set up hail on spark.


Linux: Centos 7.8
Python: 3.7.3 (anaconda)
Apache spark: spark-2.2.0-bin-hadoop2.6
Hadoop: hadoop-2.6.0
Java -version (info. I’m using linux server by korea Institution, So i can’t use root permission)
openjdk version “1.8.0_262”
OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
Hail version: 0.2.68

  1. Run start-master.sh and start-slaves.sh in spark sbin directory.
  2. (bash) pyspark

I got message below.



How can i set up hail on spark?
Do i need to change java version?

Thank you for your services.

My , <conf/spark-defaults.conf> and <./spark-env.sh> are below.

<.bashrc>

#SPARK
export SPARK_HOME=/home/edu1/tools/spark-2.2.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PYTHONPATH=$HAIL_HOME/python:$SPARK_HOME/python:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):$PYTHONPATH

# Hail
export HAIL_HOME=/home/edu1/miniconda2/envs/Hail-on-spark/lib/python3.7/site-packages/hail
export PATH=$PATH:$HAIL_HOME/bin
export PYTHONPATH=$PYTHONPATH:$HAIL_HOME/python
export SPARK_CLASSPATH=$HAIL_HOME/backend/hail-all-spark.jar

# JAVA (I just can modify .bashrc, so This would not apply to java path.)
export JAVA_HOME=/home/edu1/tools/jdk-1.8.0_231
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=$JAVA_HOME/lib/tools.jar

# Hadoop
export HADOOP_INSTALL=/home/edu1/tools/hadoop-2.6.0
export PAHT=$PATH:$HADOOP_INSTALL/bin
export LD_LIBRARY_PATH=$HADOOP_INSTALL/lib/native

</spark/conf/spark-defaults.conf>

spark.master                     spark://training.server:7077

spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator           is.hail.kryo.HailKryoRegistrator
spark.speculation                True

spark.driver.memory              37414m
spark.executor.memory            37414m
spark.executor.instances         1

spark.driver.extraClassPath      /home/edu1/miniconda2/envs/Hail-on-spark/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar
spark.executor.extraClassPath    /home/edu1/miniconda2/envs/Hail-on-spark/lib/python3.7/site-packages/hail/hail-all-spark.jar
spark.jars                       /home/edu1/miniconda2/envs/Hail-on-spark/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar

spark.eventLog.enabled           true
spark.history.fs.logDirectory    file:/tmp/spark-events
spark.enevtLog.dir               file:/tmp/spark-events

spark.ui.reverseProxy            true
spark.ui.reverseProxyUrl         spark://training.server/spark
spark.executor.extraJavaOptions  -Dlog4j.debug=true

</spark/conf/spark-env.sh>

export SPARK_WORKER_INSTANCES=1

I think the problem here is that your Spark version is not the same version Hail is compiled against. Hail can be compiled against. I’m assuming that you installed Hail with pip?

Are you actually running a cluster, or just running Spark to multithread on a single server? If you are running on a single server, running in local mode will be much easier – just pip install Hail and its pyspark dependency, unset all your spark variables like SPARK_HOME, and it should work.

Hi !
I did install hail by using pip

Thank you for your help!