Hail with Jupyter Notebooks
Start up a cluster
To start a Google Dataproc cluster with Jupyter Notebook installed and configured for use with Hail, use the
gs://hail-common/init_notebook.py initialization script while starting the cluster.
This script installs an Anaconda Python 2.7 distribution and some supplementary Python modules on the master node, modifies some default Spark configuration settings to allow the latest Hail jar to be distributed to worker nodes as needed, and creates a Hail kernel for Jupyter with the appropriate environment variables set.
The initialization script, as well as a small suite of helper scripts for working with Hail on Google Dataproc, can be seen in this repository: https://github.com/Nealelab/cloud-tools.
Connect to Jupyter Notebook
After a cluster has been started with the notebook initialization script, you'll need to connect to the Jupyter notebook running on the master node. To do this, open an SSH tunnel to the master node and configure Google Chrome using the SOCKS protocol proxy to access the cluster, as described by Google here.
connect_cluster.py script in the
cloud-tools repository referenced above does this for you. To use the script, download a copy to your local machine and run it:
$ ./connect_cluster.py --name mycluster
The script will use port
10000 on your local machine by default to open the SSH tunnel, but you can change this to port
p by adding
--port p to the script invocation.
Note: The connect script assumes you have Google Chrome installed on your machine in the (default) location:
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome.
Use Hail in a notebook
After opening an SSH tunnel and a properly configured Google Chrome browser using the
connect_cluster.py or your own methods, navigate to
localhost:8123. (the notebook initialization script starts the Jupyter notebook server on port
8123 of the master node by default).
You should see the Google Storage home directory of the project your cluster was launched in, with all of the project's buckets listed.
Select the bucket you'd like to work in, and you should see all of the files and directories in that bucket. You can either resume working on an existing
.ipynb notebook file in the bucket, or create a new Hail notebook by selecting
Hail from the
New notebook drop-down in the upper-right corner.
From the notebook, you can use Hail the same way that you would in a complete job script:
hc = hail.HailContext()
When a command is run that invokes a Spark job, you'll see a progress bar displayed below the notebook code cell until the job is complete. You can also get a more in-depth look at the progress of running Spark jobs by opening a new tab in your proxy-configured browser and navigating to the Spark Web UI at
You can scale the number of worker nodes in your cluster up or down while the Jupyter notebook is running by using the Google Cloud Console.
To read or write files stored in a Google bucket outside of Hail-specific commands, use Hail's
hadoop_write() helper functions. For example, to read in a file from Google storage to a pandas dataframe:
import pandas as pd
hc = hail.HailContext()
with hail.hadoop_read('gs://mybucket/mydata.tsv') as f:
df = pd.read_csv(f, sep='\t')
When you save your notebooks using either
File -> Save and Checkpoint or
command + s, they'll be saved automatically to the bucket you're working in.