We are getting an exception when calling the hl.import_vcf() function. The hl.init() function completes without issue, which makes me think the hail jar is loading properly. The import_vcf function is the first one we are running that creates a job on the cluster. We are running this on an AWS EMR cluster with hail (master branch), and spark 2.2.1.
Exception from task attempt logs:
org.apache.spark.SparkException: Failed to register classes with Kryo
Caused by: java.lang.ClassNotFoundException: is.hail.kryo.HailKryoRegistrator
I suspect that this problem is related to the hail jar not being properly visible on Spark workers. hl.init() will fail if the driver isn’t configured correctly, but if the workers are misconfigured, I’d expect a java.lang.ClassNotFoundException in the first place that loads Hail classes on worker machines.
I think maybe you need to pass --jars or the appropriate config in the spark-defaults to get the jar to ship correctly to the workers.
I’m sorry you’re running into this issue! Since hl.init() executed successfully, I suspect the Hail jar is located in the correct location on the driver node. However, the import_vcf function must actually communicate with the executors (worker nodes). Based on the error message, I suspect hail-all-spark.jar is not located in the current working directory of the Spark processes on your executors. If you are using spark-shell, are you also passing the --jars parameter? If you’re not using spark-shell what command are you using to start interacting with the cluster?
We are interacting with hail using the Apache Toree - Pyspark kernel for Jupyter. The jar is on the namenode of the cluster in /home/hadoop. The following is the first cell of our notebooks:
sc.addFile(’/home/hadoop/hail-all-spark.jar’)
sc.addPyFile(’/home/hadoop/hail-python.zip’)
import hail as hl
hl.init(sc)
My assumption was sc.addFile was adding the jar to hdfs. This worked fine with 0.1 - this is our first attempt with 0.2.
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File “/tmp/kernel-PySpark-583e9b04-451e-4331-9e04-300634d28644/pyspark_runner.py”, line 194, in
eval(compiled_code)
File “”, line 1, in
AttributeError: ‘SparkContext’ object has no attribute ‘addJar’
These assert that, on both the driver and the executors, the jar is located in the working directory of the driver process. If you ssh to one of your executors and find the spark job working directory (try looking in /var/run/spark/work), I suspect you will not find hail-all-spark.jar in that directory. While you’re at it, can you open a terminal in your Jupyter notebook and verify that the hail-all-spark.jar is indeed in the working directory of your executor?
Thanks for following up on this. sc._jsc.addJar did the trick! My worker nodes don’t have /var/run/spark. I searched for the jar on the entire filesystem of the worker node and did not find it. Is it recommended to use _jsc?