Using Hail with Jupyter Notebooks on Google Cloud

Out of date warning (May 11, 2018):

This post is extremely dated, and doesn’t really apply to use of Hail in 2018. The cloudtools repository README is more accurate, and is updated somewhat regularly.

Hail with Jupyter Notebooks

Start up a cluster

To start a Google Dataproc cluster with Jupyter Notebook installed and configured for use with Hail, use the gs://hail-common/ initialization script while starting the cluster.

This script installs an Anaconda Python 2.7 distribution and some supplementary Python modules on the master node, modifies some default Spark configuration settings to allow the latest Hail jar to be distributed to worker nodes as needed, and creates a Hail kernel for Jupyter with the appropriate environment variables set.

The initialization script, as well as a small suite of helper scripts for working with Hail on Google Dataproc, can be seen in this repository:

Connect to Jupyter Notebook

After a cluster has been started with the notebook initialization script, you’ll need to connect to the Jupyter notebook running on the master node. To do this, open an SSH tunnel to the master node and configure Google Chrome using the SOCKS protocol proxy to access the cluster, as described by Google here.

The script in the cloud-tools repository referenced above does this for you. To use the script, download a copy to your local machine and run it:

$ ./ --name mycluster

The script will use port 10000 on your local machine by default to open the SSH tunnel, but you can change this to port p by adding --port p to the script invocation.

Note: The connect script assumes you have Google Chrome installed on your machine in the (default) location: /Applications/Google Chrome.

Use Hail in a notebook

After opening an SSH tunnel and a properly configured Google Chrome browser using the or your own methods, navigate to localhost:8123. (the notebook initialization script starts the Jupyter notebook server on port 8123 of the master node by default).

You should see the Google Storage home directory of the project your cluster was launched in, with all of the project’s buckets listed.

Select the bucket you’d like to work in, and you should see all of the files and directories in that bucket. You can either resume working on an existing .ipynb notebook file in the bucket, or create a new Hail notebook by selecting Hail from the New notebook drop-down in the upper-right corner.

From the notebook, you can use Hail the same way that you would in a complete job script:

import hail
hc = hail.HailContext()

When a command is run that invokes a Spark job, you’ll see a progress bar displayed below the notebook code cell until the job is complete. You can also get a more in-depth look at the progress of running Spark jobs by opening a new tab in your proxy-configured browser and navigating to the Spark Web UI at localhost:4040.

You can scale the number of worker nodes in your cluster up or down while the Jupyter notebook is running by using the Google Cloud Console.

To read or write files stored in a Google bucket outside of Hail-specific commands, use Hail’s hadoop_read() and hadoop_write() helper functions. For example, to read in a file from Google storage to a pandas dataframe:

import hail
import pandas as pd

hc = hail.HailContext()

with hail.hadoop_read('gs://mybucket/mydata.tsv') as f:
    df = pd.read_csv(f, sep='\t')

When you save your notebooks using either File -> Save and Checkpoint or command + s, they’ll be saved automatically to the bucket you’re working in.