Using Hail on Spark 2.1.1 Azure HDInsight causes error

Good day!

I am using Spark 2.1.1 on HDinsight in Azure. I have built Hail as listed here https://hail.is/docs/stable/getting_started.html for Spark version 2.1.1 . However, when I execute in ipython

from hail import *
hc = HailContext()

I get this error:

Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply.
: java.lang.IllegalArgumentException: requirement failed: This Hail JAR was compiled for Spark 2.1.1,
but the version of Spark available at runtime is 2.1.1.2.6.2.25-1.
at scala.Predef$.require(Predef.scala:224)
at is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:40)
at is.hail.HailContext$.apply(HailContext.scala:166)
at is.hail.HailContext.apply(HailContext.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Looks like py4j treats Spark version as taking real spark version 2.1.1 and appending Hadoop version which is 2.6.2.25-1 which creates version mismatch between framework version and Hail JAR version.

How to fix it and if there is a workaround?

Regards,
-Yuriy

We’ve discussed relaxing the version check to just verify the major version (X.Y) instead of the full version. Here, you’ll need to compile against the full Spark version - 2.1.1.2.6.2.25-1

Note that 0.1 isn’t really maintained anymore, so you might want to consider switching to 0.2: https://www.hail.is/docs/devel

Thanks for your answer. I did try doing a build against full version but it failed with

FAILURE: Build failed with an exception.

  • What went wrong:
    Could not resolve all dependencies for configuration ‘:runtime’.

Could not find org.apache.spark:spark-core_2.11:2.1.1.2.6.2.25-1.
Searched in the following locations:

Required by:
:hail:unspecified

Could not find org.apache.spark:spark-sql_2.11:2.1.1.2.6.2.25-1.
Searched in the following locations:

Required by:
:hail:unspecified

Could not find org.apache.spark:spark-mllib_2.11:2.1.1.2.6.2.25-1.
Searched in the following locations:

Required by:
:hail:unspecified

Could not find org.apache.spark:spark-sql_2.11:2.1.1.2.6.2.25-1.
Searched in the following locations:

Required by:
:hail:unspecified > org.elasticsearch:elasticsearch-spark-20_2.11:5.5.1

Could not find org.apache.spark:spark-core_2.11:2.1.1.2.6.2.25-1.
Searched in the following locations:

Required by:
:hail:unspecified > org.elasticsearch:elasticsearch-spark-20_2.11:5.5.1 > org.apache.spark:spark-streaming_2.11:2.1.0

  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Note I removed urls of locations where it searched due to limit of 2 links for new users…

Regards,
-Yuriy

huh, interesting. This is a good reason to relax the check a bit.

https://github.com/hail-is/hail/issues/3296

will try to have this done in the next few days

although, hmm. This is a problem with 0.1. You should upgrade to 0.2, we’ll fix the problem there!

I am not sure how easy I can change Spark version on the cluster as it comes preset by Microsoft VMs in the cloud. I will look into it, but most likely the answer will be no from ops people.

looks like azure has a 2.2 Spark image as of a few weeks ago: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes#notes-for-03202018---release-of-spark-22-on-hdinsight-36

Yes, but I read you guys (Hail) does not support it just yet. I recall SPARK_CLASSPATH was not working (reading your forums the rumor was that spark 2.2 dropped SPARK_CLASSPATH check) and thus could not find hail jar file. I started in fact with spark 2.2 and had to ask ops folks to downgrade to 2.1

Hail 0.2 actually only supports Spark 2.2+ :slight_smile:

This is mostly due to the need to move to Python 3.6, and previous version of Spark were incompatible with Python 3.

I guess I was reading old ‘getting started’ page which speaks of stable release of hail 0.1 on Spark 2.0.2 and 2.1.x

So, are you truly suggesting to start using hail 0.2 really on Spark 2.2 instead?

Yes, I’d really recommend using the 0.2 beta version. It has occasional interface changes, but it has much better interfaces and is much more flexible than 0.1 was.

Thanks!

So I built from the source Hail 0.2 on Spark 2.2 based on https://hail.is/docs/devel/getting_started.html and was about to test a sample

import hail as hl
hl.init(sc)

My question for SparkContext sc to be passed in will ipython create one for me as it is typically the case with other tools (since it is backed by a cluster and I do it from a name node) or shall I create one myself programmatically beforehand and then pass it in?

As of now, if I follow instructions sc does not exist when trying the sample.

Regards,
-Yuriy

Hail will create a Spark context with default parameters if you don’t pass one into hl.init(). If you do pass one in, you’ll need to be sure to set the config parameters mentioned in the getting started page.

We usually let Hail construct the Spark context.

Apparently parts of hadoop codebase still uses python 2 so it fails run under python 3 (eg Anaconda) used by Hail eg /usr/bin/hdp-select script

Oh well. Maybe at some point, it will be seamless integration …

In [1]: import hail as hl;

In [2]: hl.init();

File “/usr/bin/hdp-select”, line 242
print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print("ERROR: Invalid package - " + name)?
ls: cannot access ‘/usr/hdp//hadoop/lib’: No such file or directory
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.3.2-13/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.3.2-13/spark_llap/spark-llap-assembly-1.0.0.2.6.3.2-13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

TypeError Traceback (most recent call last)
in ()
----> 1 hl.init();

in init(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference)

/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py in typecheck(orig_func, *args, **kwargs)
488 def typecheck(orig_func, *args, **kwargs):
489 args
, kwargs
= check_all(orig_func, args, kwargs, checkers, is_method=False)
–> 490 return orig_func(*args_, **kwargs_)
491
492 return decorator(_typecheck)

/usr/local/hail/build/distributions/hail-python.zip/hail/context.py in init(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference)
154 “”"
155 HailContext(sc, app_name, master, local, log, quiet, append,
–> 156 min_block_size, branching_factor, tmp_dir, default_reference)
157
158 def stop():

in init(self, sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference)

/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py in typecheck(orig_func, *args, **kwargs)
479 def typecheck(orig_func, *args, **kwargs):
480 args
, kwargs
= check_all(orig_func, args, kwargs, checkers, is_method=True)
–> 481 return orig_func(*args_, **kwargs_)
482
483 return decorator(_typecheck)

/usr/local/hail/build/distributions/hail-python.zip/hail/context.py in init(self, sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference)
51 self._jhc = self._hail.HailContext.apply(
52 jsc, app_name, joption(master), local, log, True, append,
—> 53 min_block_size, branching_factor, tmp_dir)
54
55 self._jsc = self._jhc.sc()

TypeError: ‘JavaPackage’ object is not callable

Ah… that’s really annoying. I’m not sure what to suggest :frowning: