We are getting an exception when calling the hl.import_vcf() function. The hl.init() function completes without issue, which makes me think the hail jar is loading properly. The import_vcf function is the first one we are running that creates a job on the cluster. We are running this on an AWS EMR cluster with hail (master branch), and spark 2.2.1.
Exception from task attempt logs:
org.apache.spark.SparkException: Failed to register classes with Kryo
Caused by: java.lang.ClassNotFoundException: is.hail.kryo.HailKryoRegistrator
Relevant settings in spark-defaults.conf
I suspect that this problem is related to the hail jar not being properly visible on Spark workers.
hl.init() will fail if the driver isn’t configured correctly, but if the workers are misconfigured, I’d expect a
java.lang.ClassNotFoundException in the first place that loads Hail classes on worker machines.
I think maybe you need to pass --jars or the appropriate config in the spark-defaults to get the jar to ship correctly to the workers.
I’m sorry you’re running into this issue! Since
hl.init() executed successfully, I suspect the Hail jar is located in the correct location on the driver node. However, the
import_vcf function must actually communicate with the executors (worker nodes). Based on the error message, I suspect
hail-all-spark.jar is not located in the current working directory of the Spark processes on your executors. If you are using
spark-shell, are you also passing the
--jars parameter? If you’re not using
spark-shell what command are you using to start interacting with the cluster?
We are interacting with hail using the Apache Toree - Pyspark kernel for Jupyter. The jar is on the namenode of the cluster in /home/hadoop. The following is the first cell of our notebooks:
import hail as hl
My assumption was sc.addFile was adding the jar to hdfs. This worked fine with 0.1 - this is our first attempt with 0.2.
sc.addJar have different semantics from
sc.addFile? Maybe try that?
I get an error that addJar does not exist
Message: Traceback (most recent call last):
File “/tmp/kernel-PySpark-583e9b04-451e-4331-9e04-300634d28644/pyspark_runner.py”, line 194, in
File “”, line 1, in
AttributeError: ‘SparkContext’ object has no attribute ‘addJar’
Ah, okay. addJar must have been something from Spark 1.X that was removed in 2.0. This syntax looks right…
Sorry for the long delay on my reply, @atebbe
Let’s recall your spark class path settings:
These assert that, on both the driver and the executors, the jar is located in the working directory of the driver process. If you ssh to one of your executors and find the spark job working directory (try looking in
/var/run/spark/work), I suspect you will not find
hail-all-spark.jar in that directory. While you’re at it, can you open a terminal in your Jupyter notebook and verify that the
hail-all-spark.jar is indeed in the working directory of your executor?
This StackOverflow post suggests that
addFile is inappropriate for “runtime dependencies”.
So. Assuming the jar is indeed missing from the working directory of your executors, we need to figure out how to get it there.
sc._jsc.addJar instead of
If that fails,
Apache Toree suggests using the
%AddJar magics invocation to add a jar.
Thanks for following up on this. sc._jsc.addJar did the trick! My worker nodes don’t have /var/run/spark. I searched for the jar on the entire filesystem of the worker node and did not find it. Is it recommended to use _jsc?
I don’t know why they don’t expose it in Python! Clearly it’s necessary…