Hi All,
I’ve been working on getting Hail working on a HPC cluster and I’ve been having some growing pains. After working through the databricks tutorial is seemed you have a to pass a SparkContext object to the HailContext constructor if you want to use resources already allocated (ie: provide a master url), like so
sc = SparkContext(master=url_of_allocated_driver, etc.)
hc = HailContext(sc)
do stuff in hail
But, when I try to do this even on a local machine I get errors that boil down to basically this.
My code
sc = SparkContext(“local”, “Simple App”)
hc = HailContext(sc)
when run gives this error :
Traceback (most recent call last):
File “spark_test2.py”, line 12, in
hc = HailContext(sc)
File “/scratch/PI/dpwall/computeEnvironments/hail/python/hail/context.py”, line 62, in init
parquet_compression, min_block_size, branching_factor, tmp_dir)
File “/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/share/sw/free/spark.2.1.0/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o19.apply. Trace:
py4j.Py4JException: Method apply([class org.apache.spark.api.java.JavaSparkContext, class java.lang.String, class scala.None$, class java.lang.String, class java.lang.String, class java.lang.Boolean, class java.lang.Boolean, class java.lang.String, class java.lang.Integer, class java.lang.Integer, class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Looking through the Hail source code at this point, it calls a scala version of the HailContext object (i think) and passes it all the variables that are given as arguments to the HailContext constructor. If the argument ‘sc’ is not defined in the HailContext constructor, then it seems like this default object is what gets defined at the SparkContext object attached to the HailContext object. But, in that case, how can HailContext know what the parameters are of the Spark instance (ie, what is master url) given that, at least in the above code, we can’t pass a new SparkContext with this information to the HailContext constructor as the ‘sc’ argument?
It seems like Hail is able to figure it out somehow when the script is running on the same machine as the master, which I would be curious the specifics.But, why can’t we pass SparkContext information to the HailContext? For example, lets say I have a instance of spark up and running on a HPC and I would like to run a script against it. Do I simply instantiate a version of HailContext with master parameter pointing at the master node machine url, and assume that HailContext will figure out the resources available to it?
Overall, I guess I am curious as what cases would we use a SparkContext ‘sc’ in the HailContext argument, and why is it not working in this simple test case.
Also, same error message when I try to run
hc = HailContext(sc)
In PySpark, where ‘sc’ is already defined.
Thanks so much for helping to clarify this!