is.hail.kryo.HailKryoRegistrator ClassNotFoundException

We are getting an exception when calling the hl.import_vcf() function. The hl.init() function completes without issue, which makes me think the hail jar is loading properly. The import_vcf function is the first one we are running that creates a job on the cluster. We are running this on an AWS EMR cluster with hail (master branch), and spark 2.2.1.

Exception from task attempt logs:
org.apache.spark.SparkException: Failed to register classes with Kryo
Caused by: java.lang.ClassNotFoundException: is.hail.kryo.HailKryoRegistrator

Relevant settings in spark-defaults.conf
spark.driver.extraClassPath …:./hail-all-spark.jar
spark.executor.extraClassPath …:./hail-all-spark.jar
spark.kryo.registrator is.hail.kryo.HailKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer

I suspect that this problem is related to the hail jar not being properly visible on Spark workers. hl.init() will fail if the driver isn’t configured correctly, but if the workers are misconfigured, I’d expect a java.lang.ClassNotFoundException in the first place that loads Hail classes on worker machines.

I think maybe you need to pass --jars or the appropriate config in the spark-defaults to get the jar to ship correctly to the workers.

Hi @atebbe,

I’m sorry you’re running into this issue! Since hl.init() executed successfully, I suspect the Hail jar is located in the correct location on the driver node. However, the import_vcf function must actually communicate with the executors (worker nodes). Based on the error message, I suspect hail-all-spark.jar is not located in the current working directory of the Spark processes on your executors. If you are using spark-shell, are you also passing the --jars parameter? If you’re not using spark-shell what command are you using to start interacting with the cluster?

We are interacting with hail using the Apache Toree - Pyspark kernel for Jupyter. The jar is on the namenode of the cluster in /home/hadoop. The following is the first cell of our notebooks:

sc.addFile(’/home/hadoop/hail-all-spark.jar’)
sc.addPyFile(’/home/hadoop/hail-python.zip’)
import hail as hl
hl.init(sc)

My assumption was sc.addFile was adding the jar to hdfs. This worked fine with 0.1 - this is our first attempt with 0.2.

Does sc.addJar have different semantics from sc.addFile? Maybe try that?

I get an error that addJar does not exist

sc.addJar(’/home/hadoop/hail-all-spark.jar’)
sc.addPyFile(’/home/hadoop/hail-python.zip’)

Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File “/tmp/kernel-PySpark-583e9b04-451e-4331-9e04-300634d28644/pyspark_runner.py”, line 194, in
eval(compiled_code)
File “”, line 1, in
AttributeError: ‘SparkContext’ object has no attribute ‘addJar’

Ah, okay. addJar must have been something from Spark 1.X that was removed in 2.0. This syntax looks right…

Sorry for the long delay on my reply, @atebbe

Let’s recall your spark class path settings:

spark.driver.extraClassPath …:./hail-all-spark.jar
spark.executor.extraClassPath …:./hail-all-spark.jar

These assert that, on both the driver and the executors, the jar is located in the working directory of the driver process. If you ssh to one of your executors and find the spark job working directory (try looking in /var/run/spark/work), I suspect you will not find hail-all-spark.jar in that directory. While you’re at it, can you open a terminal in your Jupyter notebook and verify that the hail-all-spark.jar is indeed in the working directory of your executor?

This StackOverflow post suggests that addFile is inappropriate for “runtime dependencies”.

So. Assuming the jar is indeed missing from the working directory of your executors, we need to figure out how to get it there.

First, try sc._jsc.addJar instead of sc.addFile.

If that fails, Apache Toree suggests using the %AddJar magics invocation to add a jar.

Thanks for following up on this. sc._jsc.addJar did the trick! My worker nodes don’t have /var/run/spark. I searched for the jar on the entire filesystem of the worker node and did not find it. Is it recommended to use _jsc?

Thanks,

Adam

I don’t know why they don’t expose it in Python! Clearly it’s necessary…