SparkContext to HailContext on HPC

Hi All,
I’ve been working on getting Hail working on a HPC cluster and I’ve been having some growing pains. After working through the databricks tutorial is seemed you have a to pass a SparkContext object to the HailContext constructor if you want to use resources already allocated (ie: provide a master url), like so

sc = SparkContext(master=url_of_allocated_driver, etc.)
hc = HailContext(sc)
do stuff in hail

But, when I try to do this even on a local machine I get errors that boil down to basically this.
My code

sc = SparkContext(“local”, “Simple App”)
hc = HailContext(sc)

when run gives this error :

Traceback (most recent call last):
File “spark_test2.py”, line 12, in
hc = HailContext(sc)
File “/scratch/PI/dpwall/computeEnvironments/hail/python/hail/context.py”, line 62, in init
parquet_compression, min_block_size, branching_factor, tmp_dir)
File “/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o19.apply. Trace:
py4j.Py4JException: Method apply([class org.apache.spark.api.java.JavaSparkContext, class java.lang.String, class scala.None$, class java.lang.String, class java.lang.String, class java.lang.Boolean, class java.lang.Boolean, class java.lang.String, class java.lang.Integer, class java.lang.Integer, class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

Looking through the Hail source code at this point, it calls a scala version of the HailContext object (i think) and passes it all the variables that are given as arguments to the HailContext constructor. If the argument ‘sc’ is not defined in the HailContext constructor, then it seems like this default object is what gets defined at the SparkContext object attached to the HailContext object. But, in that case, how can HailContext know what the parameters are of the Spark instance (ie, what is master url) given that, at least in the above code, we can’t pass a new SparkContext with this information to the HailContext constructor as the ‘sc’ argument?

It seems like Hail is able to figure it out somehow when the script is running on the same machine as the master, which I would be curious the specifics.But, why can’t we pass SparkContext information to the HailContext? For example, lets say I have a instance of spark up and running on a HPC and I would like to run a script against it. Do I simply instantiate a version of HailContext with master parameter pointing at the master node machine url, and assume that HailContext will figure out the resources available to it?

Overall, I guess I am curious as what cases would we use a SparkContext ‘sc’ in the HailContext argument, and why is it not working in this simple test case.

Also, same error message when I try to run

hc = HailContext(sc)

In PySpark, where ‘sc’ is already defined.

Thanks so much for helping to clarify this!

This is a bug, I’ve just PRed a fix:

https://github.com/hail-is/hail/pull/1507

The fix is merged to master if you’d like to try again.

I tried it in both raw python and pyspark and I got a new error. Seem to be a problem with the profile having too small a starting maxPartition size and openCost size. I’m uncertain how to change these parameters even after extensive googling. Any Ideas? Thank you!

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/PI/dpwall/computeEnvironments/hail/python/hail/context.py", line 64, in __init__
    parquet_compression, min_block_size, branching_factor, tmp_dir)
  File "/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o18.apply.
: is.hail.utils.package$FatalException: Found problems with SparkContext configuration:
  Invalid config parameter 'spark.sql.files.openCostInBytes=': too small. Found 0, require at least 50G
  Invalid config parameter 'spark.sql.files.maxPartitionBytes=': too small. Found 0, require at least 50G
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:5)
	at is.hail.utils.package$.fatal(package.scala:20)
	at is.hail.HailContext$.checkSparkConfiguration(HailContext.scala:104)
	at is.hail.HailContext$.apply(HailContext.scala:162)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

These are Spark config options that Hail requires – they’re not really documented, which is a problem. When Hail creates a SparkContext, it configures it properly, but when one is passed in (hc = HailContext(sc)) we have to check.

How is your spark context created? Is it created from a pyspark submit command? You can configure most Spark startup actions with a --conf option, e.g. --conf spark.sql.files.maxPartitionBytes=1000000000

got it. I searched high and low on the spark configuration page for those parameters, couldn’t find them. Found them on the pyspark.sql github page, but couldn’t find how to pass them in to the configuration.

I tried passing the new variables into spark-submit via --conf, which worked fine, but when I tried to create a new SparkContext in raw python the same error persists. This is what I tried

conf = SparkConf()
conf.set('spark.sql.files.maxPartitionBytes','60000000000')
conf.set('spark.sql.files.openCostInBytes','60000000000')
conf.set('master','local[*]')
conf.set('spark.submit.deployMode', u'client')
conf.set('spark.app.name', u'PyTest')
sc = SparkContext(conf=conf)

which gave the same error as before. Interesting behaviour?

Interesting. I’m able to replicate the error by creating a fresh (unconfigured) context and trying to pass that into Hail, but don’t get an error when I paste in your command line:

In [1]: from pyspark import *

In [2]: from hail import *

In [3]: conf = SparkConf()
   ...: conf.set('spark.sql.files.maxPartitionBytes','60000000000')
   ...: conf.set('spark.sql.files.openCostInBytes','60000000000')
   ...: conf.set('master','local[*]')
   ...: conf.set('spark.submit.deployMode', u'client')
   ...: conf.set('spark.app.name', u'PyTest')
   ...: sc = SparkContext(conf=conf)
   ...:

In [4]: hc = HailContext(sc)
hail: info: SparkUI: http://192.168.1.2:4040

It’s the same error, yes? “Found 0, require at least 50G”?

Yes, that is the same error I have been getting