Running Hail with a remote Spark


Hello. Can I tell Hail to use a remote Spark installation? All of the examples I’ve seen require specifying local paths in the configuration, but I would like my Hail installation on a different host. I’m new to Spark and Hail, so please let me know if what I’m trying to do doesn’t make sense.

Thanks for your help.


Hi Joel,
Could we get a bit more information? Usually, if you can run Spark in any given system setup, you can also run Hail.

Most Hail users (at least at the Broad) are running on Google Dataproc at the moment, which manages the deployment from a Hail jar on a bucket. This post has more info: Using Hail on the Google Cloud Platform


Sure. I do want to use Google Dataproc, but I want to manage the Hail deployment myself, if possible. My goal is to run Hail in a Docker container so I can package it with Jupyter and Firecloud. This setup works for me if I run a local Spark inside the Docker container, but I’m looking for a way to submit to Google Dataproc’s Spark, outside of Docker, instead.


Hi Joel,
This is uncharted territory for us, so I’m not totally sure, but I think that Google Dataproc won’t really work here. Dataproc makes it easy to use Spark by installing Spark on every VM and assigning a master inside the cluster, then connecting the VMs with Spark’s start_master and start_worker scripts.

I do still think it’s possible to use a FireCloud VM as the driver in an external Spark cluster, but it’ll involve installing Spark on each machine and running the start_worker script with the driver VM IP address.

I’m busy this week and traveling the next, but would be happy to meet again to chat about this when I get back.


Thanks Tim!

We’re exploring a few possible routes to accomplish what we’re looking for. I’ll focus on a different alternative for a while, but I would probably be up for a chat when you get back.


Is it sufficient to use a docker container as an execution environment for a script that starts a cluster and waits for the result? A la:

# add the dataproc options you need (e.g. machine type, number of workers)
gcloud dataproc clusters create mycluster \
  --initialization-actions gs://hail-common/

HASH=$(gsutil cat gs://hail-common/latest-hash.txt)

gcloud dataproc jobs submit pyspark \
  --cluster=mycluster \
  --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-$HASH.jar \
  --py-files=gs://hail-common/pyhail-hail-is-master-$ \
  --properties="spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-$HASH.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-$HASH.jar" \


That’s an interesting idea, danking, assuming I can make Jupyter submit using this command. Worth looking into as a possibility!


How do you use Jupyter notebooks?

If you need an interactive notebook backed by a Dataproc cluster, Liam has a great post on how to set that up. Liam suggests running a notebook server on the master node and connecting to the notebook server over an SSH tunnel from the user’s machine. He links to a repository of scripts that handle all the configuration and tunneling for the user.


Yes, I was able to follow those instructions to run Jupyter on Dataproc.

But I’m not the end user here. My goal is to package up Jupyter + Hail in a Docker image for FireCloud users.


How will the users connect to the Jupyter instance?

If the Docker image is meant to serve Jupyter on a particular port that the user can navigate to, perhaps you can set up a proxy in the docker that connects to the dataproc master running the actual Jupyter instance.

If it would help, I’d be happy to chat in person. I’m around the next two weeks in 75A-9101.

FileNotFoundException when reading VDS

Thanks for the proxy suggestion, Dan. We don’t have a specific plan for user interaction yet, so we’ll make sure to explore that idea.

My current plan is to proceed step by step up the stack. Once I have Spark working the way I like, I’ll probably have Hail (and then Jupyter) questions for you.