Connect hail to master spark server on kubernetes


On our cluster I have a spark server running on kubernetes to which I usually connect in pyspark by setting up a Spark.Context with master pointing to the corresponding address (like spark://master_address:7077).

It is not clear to me how to install hail in our spark server and then connect to it from my python scripts and notebooks.

From the documentation, I think we first need to compile hail on spark server using install-on-cluster. Do we need to do any additional configuration after running make install-on-cluster?
Any recommendation on how to do this when the spark server is managed by kubernetes?

Once hail is installed in the spark server (let’s say the spark master address is, how can I proceed to connect hail to this server?
Is it enough to create a pyspark Spark Context like this
[(‘spark.driver.memory’, ‘16g’),
(‘spark.driver.cores’, ‘4’),
(‘spark.executor.memory’, ‘8g’),
(‘spark.executor.cores’, ‘4’),
(‘spark.driver.extraClassPath’, ‘$HAIL_HOME/hail-all-spark.jar’),
(‘spark.executor.extraClassPath’, ‘$HAIL_HOME/hail-all-spark.jar’),
(‘spark.serializer’, ‘org.apache.spark.serializer.KryoSerializer’),
(‘spark.kryo.registrator’, 'is.hail.kryo.HailKryoRegistrator ')]

sc=SparkContext(master=“spark://”, appName=“hail”, conf=conf)

and then pass this to hail.init(sc)? Or I have to do something else?

Thanks for support!

I suspect the easiest thing to do is to run the install-on-cluster command in the Dockerfile that you use to generate the Docker image for your master pod.

To initialize Hail, I would just use hl.init(master='spark://...') If you need to pass special spark configuration, you can use spark_conf=.... You don’t need to set the class paths, the serializer, or the Kryo registration if you let hl.init create its own SparkContext.

Thanks for the suggestion. I’ve asked admin to install hail into the docker image they use for the spark server.
On the client side, so on my machine where I want to launch the analysis, I currently have hail installed in a conda env using pip. Is this OK to run an hail isntance connected to the spark server using hl.init(master='spark://...') as you suggested?
Or I also need to build hail with install-on-cluster mode?


I’ve never done what you’ve described, but, yes, I believe that should work just fine with a pip-installed Hail.