Unable to initialize hail - pyspark - py4J error

Hello Everyone,

I am trying to initialize the hail. Please find the codes that I executed in jupyter notebook. can help me with this error? I am trying to debug as well

import findspark
findspark.init()
import pyspark
import hail as hl
import os
from pathlib import Path
%env SPARK_HOME /opt/spark
%env HAIL_HOME /opt/hail/hail
hail_home = Path(os.getenv('HAIL_HOME'))
hail_jars = hail_home/'build'/'libs'/'hail-all-spark.jar'
conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '80g'),
    ('spark.executor.memory', '80g'),
    ('spark.local.dir', '/tmp,/data/volume03/spark')
])
sc = pyspark.SparkContext('local[*]', 'Hail', conf=conf)
hl.init(sc)   #error in this line 

The error message is shown below

Py4JError: An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply. Trace:
py4j.Py4JException: Method apply([class org.apache.spark.SparkContext, class java.lang.String, null, class java.lang.String, class java.lang.Boolean, class java.lang.Integer, class java.lang.String, class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:276)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

By following the suggestion from this post, if I type the below

sc = pyspark.SparkContext()

I get another error message which is given below

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=Hail, master=local[*]) created by init at :1

So I referred this SO post but it couldn’t take multiple arguments as shown below and resulted in error

sc = pyspark.SparkContext('local[*]', 'Hail', conf=conf)
hl.init(sc)

TypeError: getOrCreate() got multiple values for argument ‘conf’

Can help me fix this error?

Hi @Aks,

Sorry you’re having trouble. The pyspark.SparkContext() suggestion needs to be run from a brand new python session. Exit out of any running python session and start a new one.

It looks like you don’t have the correct version of Spark installed. What version of Spark do you have installed?

@danking - Thanks for your response and time.

When I issue spark-submit --version, I get 2.4.1 (spark image with dashes in terminal screen)

but when I type the below command

whereis spark-shell

/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell

Apologies. does this help? Am new to all this installation and trying to learn.

  1. Can you also help me understand why does it show 2.4.4-bin-hadoop2.7 when version is 2.4.1

  2. Can you also let me know why does it also show contents from my virtual env ? and what does that mean?

I believe you’re trying to install hail on a Spark cluster. In that case, you should not pip install hail. Doing so will install a toy version of Spark that interferes with your real Spark cluster.

  1. The toy version is probably 2.4.1, and appears first in your path. Verify with: pip show pyspark and which pyspark.
  2. That toy version is what you see in your virtual env.

First, clean up your environment by removing hail and pyspark:

pip uninstall hail pyspark

Now try installing hail from source again.

1 Like