To use Hail on the Google cloud, you will need:
- A Google account linked to a project with billing capabilities. Only for Broad users: you need to ask IT (firstname.lastname@example.org) to create the project and give the cost object that should be link to that account.
- Permissions to run Dataproc on that cluster (see Permissions)
- Enable the Dataproc API. Go to API Manager within your project and enable Dataproc API.
- Sufficient quotas for the cluster you want to start.
- Correct permissions to read (and probably write) on the data you will use (see Permissions)
- Download and install gsutil (https://cloud.google.com/storage/docs/gsutil). These are a set of tools to interact with the google cloud from the command line.
The Google storage system
The Google storage system will likely hold most of your data. Data on the Google cloud is stored in buckets. Your files will be stored as objects into the bucket in a flat manner (there are no directories). You can however have
/ in your object names and they will be interpreted as they were directories. To view or create buckets, you can go to your project webUI and click on Storage in the menu. Buckets are associated with a storage class and all data stored in that bucket will be in that storage class. Different storage classes have very different prices (see https://cloud.google.com/storage/docs/storage-classes). For data analytics applications, like Hail, the
regional storage class is likely the right choice.
You can then navigate your files and edit permissions (see below). You can also use
gsutil to create or view the content of buckets, e.g. to list the files in mybucket, use
gsutil ls gs://mybucket/. Google storage urls start with
gs:// and most of the gsutil command are named after their unix counterpart. Note that a useful option available for most commands in
-m, which runs the command in parallel.
Permissions on the Google cloud are complex and way beyond the scope of this guide. This section just assembles a few notes to get users going but is very far from complete and may omit information that might be important to your project, so consult Google documentation. That said, in the use case of running Hail you will need permissions to run and bill Dataproc on the projects and your service account (see below) will need permissions to read the data you want to access and write where you want to write using Hail.
Project level permissions
Each user is assigned one or more roles that grant the user permissions on resources (e.g. compute, storage, etc.) in the project. If you are a project owner, you can edit roles for other users in the IAM & Admin section of your project webUI (see also: https://cloud.google.com/iam/docs/quickstart).
Bucket level permissions
Permissions and default object permissions (i.e., those given to an object created in the bucket by default) can be set for each bucket. Reader, Writer and Owner can be set. These permissions can either be set on the webUI by clicking the
... on the left side of the bucket name
gsutil acl set (documentation here). Note that by default project Readers, Writers, Owners have Reader, Owner, Owner access to the buckets in the project buckets respectively.
Object level permissions
Object-level access can also be set as reader, writer or owner and changed just as with the bucket. You can also make a file public by selecting
Because there are no directories (see The Google storage system), it can be tedious to change permissions using the webUI. Fortunately, gsutil can change the permissions as if there were directories using
gsutil -m acl ch -r. For example if you want to grant Reader permissions to email@example.com to all files starting with
gs://mybucket/myvirtualdir/, you can wirte
gsutil -m acl ch -r firstname.lastname@example.org:R gs://mybucket/myvirtualdir. Note the
-m option, which allows the operation to run in parallel.
The virtual machines used with Dataproc (and other Google compute services) do not run with your user credentials; instead they run using credentials for a service account. Every project has a default service account, which will be used when running Dataproc. This and other service account addresses can be found in the IAM & Admin menu in your project.
In order to process data with Hail using Dataproc, the service account address needs to have the proper credentials to read/write data where necessary. Caution: service accounts are NOT personal and thus can be used by multiple users in the project (depending on their roles and permissions), so granting a service account access to data can potentially result in other people from that project having access to your data.
Dataproc cluster configuration
Cluster architecture in a nutshell
In a dataproc cluster, there are 3 classes of machines that will be used:
- Master node (n=1): a single machine that assigns and synchronizes tasks on the workers and processes some results.
- Worker nodes (n>=2): these machines are dedicated workers that are reserved for your task and will process the data. Worker nodes are about 5x more expensive than preemptible nodes, so usually it makes sense to minimize the number of regular worker nodes and with most of the workforce consisting of preemptible worker nodes. Worker nodes can mount local SSDs on which the Hadoop file system will be mounted. This provides fast storage accessible from all nodes and can be useful when producing temp files or in situations where you need fast IO.
Preemptible nodes: these machines usually constitute the main workforce on your cluster. They will automatically be of the same machine type as the regular workers, but (1) their CPUs aren’t reserved for you (you will still get the vast majority of them), (2) they cannot mount local SSDs (but have access to those mounted on the regular worker nodes), and (3) they will not be assigned more than 100G of disk (even if the worker nodes have more) unless you use the additional flag
Machine types and pricing
There are several machine types available with varying number of cores (CPUs) and memory per core. All prices and details can be found here: https://cloud.google.com/compute/pricing. In a nutshell, you can choose machines with 1, 2, 4, 8, 16 or 32 cores and 900M (highcpu), 3.75G (standard) or 6.5G (highmem) RAM per core. The price and performance scale approximately linearly with the number of CPUs, but 8 or 16 CPU machines may provide the best price/performance tradeoff. We recommend using highmem machines as many operations will benefit from additional memory, but standard often suffices. The hard disk size can also be configured. Note that all machine types are charged for a minimum of 10 minutes and then in increments of 1 minute.
Note: depending on the size of your data and the operations you are performing, this might not be the ideal configuration for your job.
- Master: n1-highmem-8, 100GB primary disk space. This should provide ample RAM and cores in case some results need to be processed locally. Because it’s only one machine, the cost of having a generous buffer of resources is not high.
Worker nodes: 2 x n1-highmem-8, 75GB primary disk space. Using the minimum of 2 workers lowers cost. 75GB primary disk space is fine unless you’re triggering a shuffle operation (e.g.
repartition), in which case you might need more depending on the overall size of your data.
Preemptible nodes: These will have the same specs (minus optional SSDs and larger than 100GB local disk) as the worker nodes. The number of preemptible workers depends on how large is your job and how fast you need the results. A good rule-of-thumb is to get no more than
nPartitions / (4 * x), where x is the number of cores per worker, preemptible workers: this means that a single core will process on average 4 partitions, which smooths out differences in the time it takes to process each partition. The absolute maximum number of cores you should use is
nPartitions, but this is a very inefficient use of resources since each stage will take as long as the worst-case partition; if you want to use more cores consider repartitioning to more partitions first, though beware of making partitions so small that processing is dominated by per-partition task startup overhead. In general, so long as there are at least a few partitions per core and each partition takes a few core-minute to process, then using more cores will complete the job proportionally faster at only a slightly higher cost and is probably worth it! You can count the number of partitions in the vds using
gsutil -m ls -r gs://mybucket/my.vds | grep -c parquet$or with sparkinfo. See also further discussion on partitioning below.
Important note about cluster size:
Clusters can be resized while they are running and your jobs will dynamically allocate the additional CPUs to partitions that have yet to be processed. Similarly you can also remove nodes and Hail will downsize appropriately (note that if you remove cores while in use, the data they were processing will be re-processed from scratch which can potentially lead to instability). A good strategy to maximize cluster use is the following:
- Start a minimal cluster with 2 workers nodes and a few preemptible nodes (0-10)
- Start your job and make sure that it is running properly (no typos, no errors in logic, etc.)
- Once it’s running smoothly, resize the cluster with the appropriate amount of nodes (see section Resizing a cluster)
- Once the job is done, resize the cluster to 2 workers and 0 preemptible nodes
- Check that everything went well and download the log and other files from HDFS or the master file system if desired.
- Delete the cluster.
Starting a cluster
Starting a cluster can either be done within a web browser or using
gsutil. Creating the cluster using
gsutil allows for passing more options than when using the browser. In particular, you can pass Spark and Hadoop properties to configure your cluster (see below). A cluster runs in a certain geographical zone. The zone you will pick depends on the location of your data (the data need to be accessible in that region, see The Google storage system). Finally you need to name your cluster. You will need to provide your cluster name when starting a job.
Below is a suggested configuration for cluster mycluster in project myproject (see also Machine types and Recommended configuration for a detailed description). Important: You need to set the version to 1.1 using the
--image-version 1.1 option.
gcloud dataproc clusters create mycluster \ --zone us-central1-b \ --master-machine-type n1-highmem-8 \ --master-boot-disk-size 100 \ --num-workers 2 \ --worker-machine-type n1-highmem-8 \ --worker-boot-disk-size 75 \ --num-worker-local-ssds 1 \ --num-preemptible-workers 4 \ --image-version 1.1 \ --project myproject \ --properties "spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1" \ --initialization-actions gs://hail-common/hail-init.sh
Initiatlization actions are scripts that are run on the machines of a cluster when they start. They can for example be used to load data locally on the machines or install software/packages. Hail requires the python
decoratorpackage to run its python interface and thus the initialization action in
gs://gnomad-lfran/hail/hail-init.shthat download and install this package is required. When running VEP another initialization action is required (see Running VEP). Note that Initialization actions can also be specified when starting a cluster from the web interface.
You can set Spark, Hadoop and Yarn configuration by using the
--propertiesoption. Below are descriptions of recommended properties to use (for more detail on all the Spark configuration options, see https://spark.apache.org/docs/1.6.1/configuration.html).
spark:spark.executor.extraJavaOptions=-Xss4M=> these increase the Java stack size, addressing the issue discussed in this post.
spark.driver.memory=45g=> This allocates memory for Spark on the master node. Should not be more than the memory of the machine. So if you’re using a different type of machine than n1-highmem-8 you need to modify this value.
spark.driver.maxResultSize=30g=> Maximum size of results. Shouldn’t be larger than spark.driver.memory.
spark.kryoserializer.buffer.max=1g=> This increases the maximum buffer size when serializing objects
hdfs:dfs.replication=1=> This sets the Hadoop number of replications. Hadoop is a distributed file system that runs on the regular worker nodes but is accessible to all nodes (master, regular workers and preemptible workers). Data in Hadoop can be stored multiple times to increase performance (due to data locality) and reliability. In most cases, when running Hail on the Google cloud data is read/written to the Google Storage and not to HDFS. Unless you specifically want to use HDFS for you analyses, only temp data from Hail will be written there and thus a small replication factor will allow you to use smaller work disks (so cheaper).
Resizing a cluster
Browser: select the cluster name in Dataproc. Then go to the
Configuration tab and click on
Edit to change the number of workers and/or preemptible workers. Click on
Save to apply the changes.
Terminal: use the
gcloud dataproc clusters update command. For example, to change the cluster mycluster to 5 workers and 200 preemptible workers:
gcloud dataproc clusters update --num-workers 5 --num-preemptible-workers 200 mycluster.
Deleting a cluster
gcloud dataproc clusters delete command. For example, to delete mycluster:
gcloud dataproc clusters delete mycluster.
Logging onto the nodes of a cluster (SSH)
You can log onto any of the nodes of your cluster, which may be useful to look at node health and logs, or to interact with HDFS or the master file system.
Browser: select the cluster name in Dataproc. Then go to the
VM Instances tab, click on the node you want to log in and finally click on
Terminal: To log onto node mynode use
gcloud compute ssh mynode. The master is always named after your cluster name postfixed with ‘-m’. For example, if you cluster is mycluster, the master will be mycluster-m. You can also copy files (equivalent to
gcloud compute copy-files <source> <destination>, where either the source or the destination can be a node.
You can check the relevant quotas in IAM and Admin.
Relevant numbers are:
- CPUs => the maximum number of dedicated (i.e., master and worker) cores you can ask for
- Preemptible CPUS => how many preemptible cores you can ask for
If these limits are too low for the work you need to do, there is a button to request a quota increase on the right of each item. The procedure to increase the quota can take a couple of days.
Running a Hail job
Finding the Hail files on the cloud
Hail jars are continually deployed on the Google cloud in
gs://hail-common/. Each release is assigned a hash value, and the hash value for the latest hail release is always written to
gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt, which can be read using
gsutil cat gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt.
Given a hash value
$HASH, the corresponding Hail files are:
- The Hail jar:
- The Hail python lib:
Submitting a Hail job
To submit a Hail job to a Dataproc cluster, you need a python script using hail, for example the following Hail script will count variants and samples:
from hail import * hc = HailContext() hc.read('gs://mybucket/myvds.vds').count()
Then, use the
gcloud dataproc job submit pyspark command.
For example to run the script above named myhailscript.py using cluster mycluster, the jar _ gs://hail-common/builds/0.1/jars/hail-0.1-$HASH-Spark-2.0.2.jar _ and the libs gs://hail-common/builds/0.1/python/hail-0.1-$HASH.zip:
gcloud dataproc jobs submit pyspark \ --cluster=mycluster \ --files=gs://hail-common/builds/0.1/jars/hail-0.1-$HASH-Spark-2.0.2.jar \ --py-files=gs://hail-common/builds/0.1/python/hail-0.1-$HASH.zip \ --properties="spark.driver.extraClassPath=./hail-0.1-$HASH-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-$HASH-Spark-2.0.2.jar" \ myhailscript.py
A simple hail submission script that submits job using the latest version of Hail with minimal effort (
usage: pyhail-submit <cluster> <py-files> ) can be downloaded at
In addition, a newer /more complete python script to start hail jobs can be downloaded at
pyhail --help usage: pyhail [-h] [--script SCRIPT] [--inline INLINE] --cluster CLUSTER [--jar JAR] [--pyhail PYHAIL] [--add_scripts ADD_SCRIPTS] optional arguments: -h, --help show this help message and exit --script SCRIPT, --input SCRIPT, -i SCRIPT Script to run --inline INLINE Inline script to run --cluster CLUSTER Which cluster to run on --jar JAR Jar file to use --pyhail PYHAIL Pyhail zip file to use --add_scripts ADD_SCRIPTS Comma-separated list of additional python scripts to add.
Important usage notes
With this script you can either submit a hail python script (using
--script) or write hail python commands directly on the command line (using
--inline). When using
--inline, a variable
HailContextis accessible and the hail log is written to
/hail.log(see Log file for more details on how to retrieve it from the master).
Inline job example:
`pyhail --inline --cluster mycluster “print(hc.read(‘gs://mybucket/my.vds’).variant_schema)”
While this script will automatically fetch the HASH for the latest version of Hail and run with that
pyhail, you can also specify which HASH to use using the environment variable
$HAIL_HASH_LOCATION, or specify the exact
If you want to pass additional python files, you can either pass them using
--add_scripts, and using the environment variable
$HAIL_SCRIPTS(using both adds all scripts).
Using an interactive Jupyter notebook
Importing data into Hail and partitions
When importing data into Hail (e.g. from a VCF), the data will be stored in many small files called partitions. You can count the number of vds partitions with
gsutil -m ls -r gs://mybucket/my.vds | grep -c parquet$
or sparkinfo. When you run Hail on your data part of the time will be spent in I/O (that is reading the data in and writing data out) and part of the time will be spent computing on the data. The optimal partition size so that most of the time is spent computing and not in I/O varies depending on the system architecture and the complexity of the computations being run. The default size for a partition in Hail is 128Mb, which is optimal for Hadoop file system but not for the cloud. Indeed the cloud I/O is very different from Hadoop and will provide best performance for partitions of size of ~1Gb. But there is a catch: the number of partitions also determines the number of cores that can compute on your data simultaneously. This means that if you have a 10GB dataset and import it into Hail as 10 partitions of 1GB, you can only process your data with at most 10 cores in parallel. Depending on how complex the compute tasks are and size of your data you will therefore have to tailor the partition size (and number). Generally speaking, you want to have enough partitions so that you can compute with high parallelism, but not so many that per-partition task startup overhead starts to dominate the computation. A good place to start would be at least 1,000 partitions for large-ish data with genotypes (say 10GB - 1TB) but no more than 20,000 partitions (unless your data is huge).
Setting the minimum size of partitions can be done on import using the global
-b Hail option. e.g. to get 1GB partitions you would use
hail -b 1024 importvcf ...
When running Hail on the Google cloud, the log file (default
hail.log) is not written on your local machine but on the master node of your cluster. By default it will be put in an automatically generated temporary directory and it will be difficult to find. For this reason, it is highly recommended to specify the desired location of the
hail.log on the master node by using the
log option when creating the
HailContext object in your python script). You can then either log on the master node to view and/or copy the log, or use
gcloud compute copy-files to transfer the log to your local machine (see Logging onto the nodes of a cluster (SSH)).
If you are running VEP from Hail as part of your pipeline, you need to add the
gs://hail-common/vep/vep/vep81-init.sh for VEP v81) initialization script when starting your cluster.
For example, using the configuration described in Starting a cluster:
gcloud dataproc clusters create mycluster \ --zone us-central1-b \ --master-machine-type n1-highmem-8 \ --master-boot-disk-size 100 \ --num-workers 2 \ --worker-machine-type n1-highmem-8 \ --worker-boot-disk-size 75 \ --num-worker-local-ssds 1 \ --num-preemptible-workers 4 \ --image-version 1.1 \ --project myproject \ --properties "spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1" --initialization-actions gs://hail-common/hail-init.sh,gs://hail-common/vep/vep/vep85-init.sh
You can then run VEP using
vep('/vep/vep-gcloud.properties'). Note that adding this initialization script will increase substantially the start time of your nodes.
Using the hadoop file system
The hadoop file system (HDFS) is a distributed file system accessible by all nodes in a cluster. It will run only on the regular worker nodes but be accessible from the preemptible nodes too (so if you use it, you might want to have a larger worker / preemptible node ratio). This file system will hold temp files written by Hail (e.g. when exporting worker data that needs to be concatenated into a single file) and you can also read/write from it. Because I/O on small files on local disk is orders of magnitude faster than from Google storage, it might be useful in situations where you have lots of small partitions. To read/write to HDFS from Hail, just prefix your in/output with
hdfs:/ (e.g. writing a file to
/hail/myfile.txt on HDFS would be
To copy data from/to HDFS you should use the hadoop
distcp command. This command will copy data in parallel over N threads (specified using the
-m option). You can either log onto the master node and run
hadoop distcp -m N <source> <destination> (where source and/or destination can be pointing to a file or location on HDFS, locally or the Google storage (using
gs://), or you can submit a hadoop distcp job to your cluster (here mycluster) using the following command:
gcloud dataproc jobs submit hadoop \ --cluster mycluster \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-distcp.jar -m N <source> <destination>
The cluster HDFS will only be accessible while the cluster is running (all data there will be deleted once the cluster is deleted). So a typical scenario where you read/write HDFS would look like
- Start cluster
- Copy data from Google storage to HDFS
- Run Hail and write to HDFS
- Copy data from HDFS to Google storage
- Delete cluster
Monitoring your jobs
When you submit a job to the cluster, you will get a job ID. In the example below,
47c44997-1d3e-4045-a4ad-9eb14f64c742 is the job ID.
Job [47c44997-1d3e-4045-a4ad-9eb14f64c742] submitted. Waiting for job output...
Important: The Hail output that you see in your terminal after submitting a job might be terminated with a credential error. But don’t worry, the job is still running and you can check it both in the browser or with the terminal (see below).
Browser: go to Dataproc > Jobs and click on the job you want to check.
Terminal: use the
gcloud dataproc jobs command to list and monitor your jobs:
gcloud dataproc jobs listwill give you the list of jobs run in that project (pipe to
headif you just want to see the last ones)
gcloud dataproc jobs wait jobidwill give you the output of job jobid
Killing a job
If you want to kill a submitted job, you must use the Google platform;
Ctrl-c will not kill the job (although you will not see progress any more on your terminal) nor will the job stop if you get disconnected. You can either use the webUI or gsutil to kill a job:
Browser: Go to Dataproc > jobs, select the job and press
Terminal: use the
gcloud dataproc kill jobid command to kill job jobid. See Monitoring your jobs to get the job ID.
Monitoring and troubleshooting jobs and clusters
To facilitate troubleshooting Hail jobs and workers, access web interfaces for Spark and Hadoop following these steps (full documentation here):
gcloud compute ssh --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" my-masterwhere my-master is the name of your master. Typically clustername-m.
- Start Google chrome (or another browser, with similar configuration) with the following options (default Chrome path for MacOS:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome):
<Google Chrome executable path> \ --proxy-server="socks5://localhost:1080" \ --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \ --user-data-dir=/tmp
- Open the page
http://my-master:8088/, where my-master is the name of your master.
You can then navigate the different interfaces and get information about your cluster environment, the state of your jobs, the state of your workers, etc.
Another useful command is
gcloud dataproc clusters diagnose which collects useful logs and information about your cluster and writes th a tar archive in a temporary object on the Google storage.