Using Hail on the Google Cloud Platform

This Post is Old and No Longer Works

Please install hail and follow the Hail on the Cloud instructions.





Prerequisites

To use Hail on the Google cloud, you will need:

  1. A Google account linked to a project with billing capabilities. Only for Broad users: you need to ask IT (help@broadinstitute.org) to create the project and give the cost object that should be link to that account.
  2. Permissions to run Dataproc on that cluster (see Permissions)
  3. Enable the Dataproc API. Go to API Manager within your project and enable Dataproc API.
  4. Sufficient quotas for the cluster you want to start.
  5. Correct permissions to read (and probably write) on the data you will use (see Permissions)
  6. Download and install gsutil (https://cloud.google.com/storage/docs/gsutil). These are a set of tools to interact with the google cloud from the command line.

The Google storage system

The Google storage system will likely hold most of your data. Data on the Google cloud is stored in buckets. Your files will be stored as objects into the bucket in a flat manner (there are no directories). You can however have / in your object names and they will be interpreted as they were directories. To view or create buckets, you can go to your project webUI and click on Storage in the menu. Buckets are associated with a storage class and all data stored in that bucket will be in that storage class. Different storage classes have very different prices (see https://cloud.google.com/storage/docs/storage-classes). For data analytics applications, like Hail, the regional storage class is likely the right choice.

alt text

You can then navigate your files and edit permissions (see below). You can also use gsutil to create or view the content of buckets, e.g. to list the files in mybucket, use gsutil ls gs://mybucket/. Google storage urls start with gs:// and most of the gsutil command are named after their unix counterpart. Note that a useful option available for most commands in gsutil is -m, which runs the command in parallel.

Permissions

Permissions on the Google cloud are complex and way beyond the scope of this guide. This section just assembles a few notes to get users going but is very far from complete and may omit information that might be important to your project, so consult Google documentation. That said, in the use case of running Hail you will need permissions to run and bill Dataproc on the projects and your service account (see below) will need permissions to read the data you want to access and write where you want to write using Hail.

Project level permissions

Each user is assigned one or more roles that grant the user permissions on resources (e.g. compute, storage, etc.) in the project. If you are a project owner, you can edit roles for other users in the IAM & Admin section of your project webUI (see also: https://cloud.google.com/iam/docs/quickstart).

alt text

Bucket level permissions

Permissions and default object permissions (i.e., those given to an object created in the bucket by default) can be set for each bucket. Reader, Writer and Owner can be set. These permissions can either be set on the webUI by clicking the ... on the left side of the bucket name

alt text

or using gsutil acl set (documentation here). Note that by default project Readers, Writers, Owners have Reader, Owner, Owner access to the buckets in the project buckets respectively.

Object level permissions

Object-level access can also be set as reader, writer or owner and changed just as with the bucket. You can also make a file public by selecting public link.

alt text

Because there are no directories (see The Google storage system), it can be tedious to change permissions using the webUI. Fortunately, gsutil can change the permissions as if there were directories using gsutil -m acl ch -r. For example if you want to grant Reader permissions to john.doe@gmail.com to all files starting with gs://mybucket/myvirtualdir/, you can wirte gsutil -m acl ch -r john.doe@gmail.com:R gs://mybucket/myvirtualdir. Note the -m option, which allows the operation to run in parallel.

Service accounts

The virtual machines used with Dataproc (and other Google compute services) do not run with your user credentials; instead they run using credentials for a service account. Every project has a default service account, which will be used when running Dataproc. This and other service account addresses can be found in the IAM & Admin menu in your project.

alt text

In order to process data with Hail using Dataproc, the service account address needs to have the proper credentials to read/write data where necessary. Caution: service accounts are NOT personal and thus can be used by multiple users in the project (depending on their roles and permissions), so granting a service account access to data can potentially result in other people from that project having access to your data.

Dataproc cluster configuration

Cluster architecture in a nutshell

In a dataproc cluster, there are 3 classes of machines that will be used:

  1. Master node (n=1): a single machine that assigns and synchronizes tasks on the workers and processes some results.
  2. Worker nodes (n>=2): these machines are dedicated workers that are reserved for your task and will process the data. Worker nodes are about 5x more expensive than preemptible nodes, so usually it makes sense to minimize the number of regular worker nodes and with most of the workforce consisting of preemptible worker nodes. Worker nodes can mount local SSDs on which the Hadoop file system will be mounted. This provides fast storage accessible from all nodes and can be useful when producing temp files or in situations where you need fast IO.
  3. Preemptible nodes: these machines usually constitute the main workforce on your cluster. They will automatically be of the same machine type as the regular workers, but (1) their CPUs arenā€™t reserved for you (you will still get the vast majority of them), (2) they cannot mount local SSDs (but have access to those mounted on the regular worker nodes), and (3) they will not be assigned more than 100G of disk (even if the worker nodes have more) unless you use the additional flag --preemptible-worker-boot-disk-size.

Machine types and pricing

There are several machine types available with varying number of cores (CPUs) and memory per core. All prices and details can be found here: https://cloud.google.com/compute/pricing. In a nutshell, you can choose machines with 1, 2, 4, 8, 16 or 32 cores and 900M (highcpu), 3.75G (standard) or 6.5G (highmem) RAM per core. The price and performance scale approximately linearly with the number of CPUs, but 8 or 16 CPU machines may provide the best price/performance tradeoff. We recommend using highmem machines as many operations will benefit from additional memory, but standard often suffices. The hard disk size can also be configured. Note that all machine types are charged for a minimum of 10 minutes and then in increments of 1 minute.

Recommended configuration

Note: depending on the size of your data and the operations you are performing, this might not be the ideal configuration for your job.

  • Master: n1-highmem-8, 100GB primary disk space. This should provide ample RAM and cores in case some results need to be processed locally. Because itā€™s only one machine, the cost of having a generous buffer of resources is not high.
  • Worker nodes: 2 x n1-highmem-8, 75GB primary disk space. Using the minimum of 2 workers lowers cost. 75GB primary disk space is fine unless youā€™re triggering a shuffle operation (e.g. repartition), in which case you might need more depending on the overall size of your data.
  • Preemptible nodes: These will have the same specs (minus optional SSDs and larger than 100GB local disk) as the worker nodes. The number of preemptible workers depends on how large is your job and how fast you need the results. A good rule-of-thumb is to get no more than nPartitions / (4 * x), where x is the number of cores per worker, preemptible workers: this means that a single core will process on average 4 partitions, which smooths out differences in the time it takes to process each partition. The absolute maximum number of cores you should use is nPartitions, but this is a very inefficient use of resources since each stage will take as long as the worst-case partition; if you want to use more cores consider repartitioning to more partitions first, though beware of making partitions so small that processing is dominated by per-partition task startup overhead. In general, so long as there are at least a few partitions per core and each partition takes a few core-minute to process, then using more cores will complete the job proportionally faster at only a slightly higher cost and is probably worth it! You can count the number of partitions in the vds using gsutil -m ls -r gs://mybucket/my.vds | grep -c parquet$ or with sparkinfo. See also further discussion on partitioning below.

Important note about cluster size:
Clusters can be resized while they are running and your jobs will dynamically allocate the additional CPUs to partitions that have yet to be processed. Similarly you can also remove nodes and Hail will downsize appropriately (note that if you remove cores while in use, the data they were processing will be re-processed from scratch which can potentially lead to instability). A good strategy to maximize cluster use is the following:

  1. Start a minimal cluster with 2 workers nodes and a few preemptible nodes (0-10)
  2. Start your job and make sure that it is running properly (no typos, no errors in logic, etc.)
  3. Once itā€™s running smoothly, resize the cluster with the appropriate amount of nodes (see section Resizing a cluster)
  4. Once the job is done, resize the cluster to 2 workers and 0 preemptible nodes
  5. Check that everything went well and download the log and other files from HDFS or the master file system if desired.
  6. Delete the cluster.

Starting a cluster

Starting a cluster can either be done within a web browser or using gsutil. Creating the cluster using gsutil allows for passing more options than when using the browser. In particular, you can pass Spark and Hadoop properties to configure your cluster (see below). A cluster runs in a certain geographical zone. The zone you will pick depends on the location of your data (the data need to be accessible in that region, see The Google storage system). Finally you need to name your cluster. You will need to provide your cluster name when starting a job.

Below is a suggested configuration for cluster mycluster in project myproject (see also Machine types and Recommended configuration for a detailed description). Important: You need to set the version to 1.1 using the --image-version 1.1 option.

gcloud dataproc clusters create mycluster \
  --zone us-central1-b \
  --master-machine-type n1-highmem-8 \
  --master-boot-disk-size 100 \
  --num-workers 2 \
  --worker-machine-type n1-highmem-8 \
  --worker-boot-disk-size 75 \
  --num-worker-local-ssds 1 \
  --num-preemptible-workers 4 \
  --image-version 1.1 \
  --project myproject \
  --properties "spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1" \
  --initialization-actions gs://hail-common/hail-init.sh

Important notes:

  • Initialization actions
    Initiatlization actions are scripts that are run on the machines of a cluster when they start. They can for example be used to load data locally on the machines or install software/packages. Hail requires the python decorator package to run its python interface and thus the initialization action in gs://gnomad-lfran/hail/hail-init.sh that download and install this package is required. When running VEP another initialization action is required (see Running VEP). Note that Initialization actions can also be specified when starting a cluster from the web interface.

  • Properties
    You can set Spark, Hadoop and Yarn configuration by using the --properties option. Below are descriptions of recommended properties to use (for more detail on all the Spark configuration options, see https://spark.apache.org/docs/1.6.1/configuration.html).

  • spark:spark.driver.extraJavaOptions=-Xss4M and spark:spark.executor.extraJavaOptions=-Xss4M => these increase the Java stack size, addressing the issue discussed in this post.

  • spark.driver.memory=45g => This allocates memory for Spark on the master node. Should not be more than the memory of the machine. So if youā€™re using a different type of machine than n1-highmem-8 you need to modify this value.

  • spark.driver.maxResultSize=30g => Maximum size of results. Shouldnā€™t be larger than spark.driver.memory.

  • spark.kryoserializer.buffer.max=1g => This increases the maximum buffer size when serializing objects

  • hdfs:dfs.replication=1 => This sets the Hadoop number of replications. Hadoop is a distributed file system that runs on the regular worker nodes but is accessible to all nodes (master, regular workers and preemptible workers). Data in Hadoop can be stored multiple times to increase performance (due to data locality) and reliability. In most cases, when running Hail on the Google cloud data is read/written to the Google Storage and not to HDFS. Unless you specifically want to use HDFS for you analyses, only temp data from Hail will be written there and thus a small replication factor will allow you to use smaller work disks (so cheaper).

Resizing a cluster

Browser: select the cluster name in Dataproc. Then go to the Configuration tab and click on Edit to change the number of workers and/or preemptible workers. Click on Save to apply the changes.

alt text

Terminal: use the gcloud dataproc clusters update command. For example, to change the cluster mycluster to 5 workers and 200 preemptible workers: gcloud dataproc clusters update --num-workers 5 --num-preemptible-workers 200 mycluster.

Deleting a cluster

Browser:

alt text

Terminal: gcloud dataproc clusters delete command. For example, to delete mycluster: gcloud dataproc clusters delete mycluster.

Logging onto the nodes of a cluster (SSH)

You can log onto any of the nodes of your cluster, which may be useful to look at node health and logs, or to interact with HDFS or the master file system.

Browser: select the cluster name in Dataproc. Then go to the VM Instances tab, click on the node you want to log in and finally click on SSH.

alt text

Terminal: To log onto node mynode use gcloud compute ssh mynode. The master is always named after your cluster name postfixed with ā€˜-mā€™. For example, if you cluster is mycluster, the master will be mycluster-m. You can also copy files (equivalent to scp) using gcloud compute copy-files <source> <destination>, where either the source or the destination can be a node.

Quotas

You can check the relevant quotas in IAM and Admin.

alt text

Relevant numbers are:

  • CPUs => the maximum number of dedicated (i.e., master and worker) cores you can ask for
  • Preemptible CPUS => how many preemptible cores you can ask for

If these limits are too low for the work you need to do, there is a button to request a quota increase on the right of each item. The procedure to increase the quota can take a couple of days.

Running a Hail job

Finding the Hail files on the cloud

Hail jars are continually deployed on the Google cloud in gs://hail-common/. Each release is assigned a hash value, and the hash value for the latest hail release is always written to gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt, which can be read using gsutil cat gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt.
Given a hash value $HASH, the corresponding Hail files are:

  1. The Hail jar: gs://hail-common/builds/0.1/jars/hail-0.1-$HASH-Spark-2.0.2.jar
  2. The Hail python lib: gs://hail-common/builds/0.1/python/hail-0.1-$HASH.zip

Submitting a Hail job

To submit a Hail job to a Dataproc cluster, you need a python script using hail, for example the following Hail script will count variants and samples:

from hail import *
hc = HailContext()
hc.read('gs://mybucket/myvds.vds').count()

Then, use the gcloud dataproc job submit pyspark command.
For example to run the script above named myhailscript.py using cluster mycluster, the jar _ gs://hail-common/builds/0.1/jars/hail-0.1-$HASH-Spark-2.0.2.jar _ and the libs gs://hail-common/builds/0.1/python/hail-0.1-$HASH.zip:

gcloud dataproc jobs submit pyspark \
  --cluster=mycluster \
  --files=gs://hail-common/builds/0.1/jars/hail-0.1-$HASH-Spark-2.0.2.jar \
  --py-files=gs://hail-common/builds/0.1/python/hail-0.1-$HASH.zip \
  --properties="spark.driver.extraClassPath=./hail-0.1-$HASH-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-$HASH-Spark-2.0.2.jar" \
  myhailscript.py

A simple hail submission script that submits job using the latest version of Hail with minimal effort (usage: pyhail-submit <cluster> <py-files> ) can be downloaded at gs://hail-common/pyhail-submit.

In addition, a newer /more complete python script to start hail jobs can be downloaded at gs://hail-common/pyhail.py. Usage:

pyhail --help
usage: pyhail [-h] [--script SCRIPT] [--inline INLINE] --cluster CLUSTER
              [--jar JAR] [--pyhail PYHAIL] [--add_scripts ADD_SCRIPTS]

optional arguments:
  -h, --help            show this help message and exit
  --script SCRIPT, --input SCRIPT, -i SCRIPT Script to run
  --inline INLINE       Inline script to run
  --cluster CLUSTER     Which cluster to run on
  --jar JAR             Jar file to use
  --pyhail PYHAIL       Pyhail zip file to use
  --add_scripts ADD_SCRIPTS Comma-separated list of additional python scripts to add.

Important usage notes

  • With this script you can either submit a hail python script (using --script) or write hail python commands directly on the command line (using --inline). When using --inline, a variable hc of type HailContext is accessible and the hail log is written to /hail.log (see Log file for more details on how to retrieve it from the master).
    Inline job example:
    `pyhail --inline --cluster mycluster ā€œprint(hc.read(ā€˜gs://mybucket/my.vdsā€™).variant_schema)ā€

  • While this script will automatically fetch the HASH for the latest version of Hail and run with that jar and pyhail, you can also specify which HASH to use using the environment variable $HAIL_HASH_LOCATION, or specify the exact jar and pyhail files using --jar and --pyhail respectively.

  • If you want to pass additional python files, you can either pass them using --add_scripts, and using the environment variable $HAIL_SCRIPTS (using both adds all scripts).

Using an interactive Jupyter notebook

Rather than submitting a static Python script, you can use Hail on the cloud interactively through a Jupyter notebook. See this forum post to learn more.

Importing data into Hail and partitions

When importing data into Hail (e.g. from a VCF), the data will be stored in many small files called partitions. You can count the number of vds partitions with

gsutil -m ls -r gs://mybucket/my.vds | grep -c parquet$

or sparkinfo. When you run Hail on your data part of the time will be spent in I/O (that is reading the data in and writing data out) and part of the time will be spent computing on the data. The optimal partition size so that most of the time is spent computing and not in I/O varies depending on the system architecture and the complexity of the computations being run. The default size for a partition in Hail is 128Mb, which is optimal for Hadoop file system but not for the cloud. Indeed the cloud I/O is very different from Hadoop and will provide best performance for partitions of size of ~1Gb. But there is a catch: the number of partitions also determines the number of cores that can compute on your data simultaneously. This means that if you have a 10GB dataset and import it into Hail as 10 partitions of 1GB, you can only process your data with at most 10 cores in parallel. Depending on how complex the compute tasks are and size of your data you will therefore have to tailor the partition size (and number). Generally speaking, you want to have enough partitions so that you can compute with high parallelism, but not so many that per-partition task startup overhead starts to dominate the computation. A good place to start would be at least 1,000 partitions for large-ish data with genotypes (say 10GB - 1TB) but no more than 20,000 partitions (unless your data is huge).
Setting the minimum size of partitions can be done on import using the global -b Hail option. e.g. to get 1GB partitions you would use hail -b 1024 importvcf ...

Log file

When running Hail on the Google cloud, the log file (default hail.log) is not written on your local machine but on the master node of your cluster. By default it will be put in an automatically generated temporary directory and it will be difficult to find. For this reason, it is highly recommended to specify the desired location of the hail.log on the master node by using the log option when creating the HailContext object in your python script). You can then either log on the master node to view and/or copy the log, or use gcloud compute copy-files to transfer the log to your local machine (see Logging onto the nodes of a cluster (SSH)).

Running VEP

If you are running VEP from Hail as part of your pipeline, you need to add the gs://hail-common/vep/vep/vep85-init.sh (or gs://hail-common/vep/vep/vep81-init.sh for VEP v81) initialization script when starting your cluster.
For example, using the configuration described in Starting a cluster:

gcloud dataproc clusters create mycluster \
  --zone us-central1-b \
  --master-machine-type n1-highmem-8 \
  --master-boot-disk-size 100 \
  --num-workers 2 \
  --worker-machine-type n1-highmem-8 \
  --worker-boot-disk-size 75 \
  --num-worker-local-ssds 1 \
  --num-preemptible-workers 4 \
  --image-version 1.1 \
  --project myproject \
  --properties "spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"
  --initialization-actions gs://hail-common/hail-init.sh,gs://hail-common/vep/vep/vep85-init.sh

You can then run VEP using vep('/vep/vep-gcloud.properties'). Note that adding this initialization script will increase substantially the start time of your nodes.

Using the hadoop file system

The hadoop file system (HDFS) is a distributed file system accessible by all nodes in a cluster. It will run only on the regular worker nodes but be accessible from the preemptible nodes too (so if you use it, you might want to have a larger worker / preemptible node ratio). This file system will hold temp files written by Hail (e.g. when exporting worker data that needs to be concatenated into a single file) and you can also read/write from it. Because I/O on small files on local disk is orders of magnitude faster than from Google storage, it might be useful in situations where you have lots of small partitions. To read/write to HDFS from Hail, just prefix your in/output with hdfs:/ (e.g. writing a file to /hail/myfile.txt on HDFS would be hdfs:/hail/myfile.txt).

To copy data from/to HDFS you should use the hadoop distcp command. This command will copy data in parallel over N threads (specified using the -m option). You can either log onto the master node and run hadoop distcp -m N <source> <destination> (where source and/or destination can be pointing to a file or location on HDFS, locally or the Google storage (using gs://), or you can submit a hadoop distcp job to your cluster (here mycluster) using the following command:

gcloud dataproc jobs submit hadoop \
  --cluster mycluster \
  --jar file:///usr/lib/hadoop-mapreduce/hadoop-distcp.jar -m N <source> <destination>

Important:
The cluster HDFS will only be accessible while the cluster is running (all data there will be deleted once the cluster is deleted). So a typical scenario where you read/write HDFS would look like

  1. Start cluster
  2. Copy data from Google storage to HDFS
  3. Run Hail and write to HDFS
  4. Copy data from HDFS to Google storage
  5. Delete cluster

Monitoring your jobs

When you submit a job to the cluster, you will get a job ID. In the example below, 47c44997-1d3e-4045-a4ad-9eb14f64c742 is the job ID.

Job [47c44997-1d3e-4045-a4ad-9eb14f64c742] submitted.
Waiting for job output...

Important: The Hail output that you see in your terminal after submitting a job might be terminated with a credential error. But donā€™t worry, the job is still running and you can check it both in the browser or with the terminal (see below).

Browser: go to Dataproc > Jobs and click on the job you want to check.

alt text

Terminal: use the gcloud dataproc jobs command to list and monitor your jobs:

  • gcloud dataproc jobs list will give you the list of jobs run in that project (pipe to head if you just want to see the last ones)
  • gcloud dataproc jobs wait jobid will give you the output of job jobid

Killing a job

If you want to kill a submitted job, you must use the Google platform; Ctrl-c will not kill the job (although you will not see progress any more on your terminal) nor will the job stop if you get disconnected. You can either use the webUI or gsutil to kill a job:

Browser: Go to Dataproc > jobs, select the job and press STOP.

alt text

Terminal: use the gcloud dataproc kill jobid command to kill job jobid. See Monitoring your jobs to get the job ID.

Monitoring and troubleshooting jobs and clusters

To facilitate troubleshooting Hail jobs and workers, access web interfaces for Spark and Hadoop following these steps (full documentation here):

  1. Run gcloud compute ssh --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" my-master where my-master is the name of your master. Typically clustername-m.
  2. Start Google chrome (or another browser, with similar configuration) with the following options (default Chrome path for MacOS: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome):
<Google Chrome executable path> \
  --proxy-server="socks5://localhost:1080" \
  --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
  --user-data-dir=/tmp
  1. Open the page http://my-master:8088/, where my-master is the name of your master.

You can then navigate the different interfaces and get information about your cluster environment, the state of your jobs, the state of your workers, etc.

Another useful command is gcloud dataproc clusters diagnose which collects useful logs and information about your cluster and writes th a tar archive in a temporary object on the Google storage.

4 Likes

Hey guys,
Thanks for taking the time to let everyone know how to set up on the cloud, much appreciated!!

Two points:

  1. My personal gmail account canā€™t access gs://hail-common/hail-origin-master-all-last.jar or gs://gnomad-broad/tools/vep/vep85-init.sh Can you please change such that others outside the Broad Institute can also access?

  2. For this tutorial, you need to be explicit with the image version of dataproc. The default version is 1.1 which corresponds to Spark2.0.1 and the current hail version does not work. While, version 1.0 some commands work for me but splitmulti does not work for me. Image version 0.2 corresponds to the Spark version, that is Spark1.5.2 that we use locally but not sure how long they will keep supporting.

https://cloud.google.com/dataproc/docs/concepts/dataproc-versions

Thanks!

Another thank you for the directions!

I also had trouble accessing gs://hail-common/hail-origin-master-all-last.jar as a non-Broad affiliate, got 403 Forbidden.

I spoke to Tim on Gitter and we found that hail-origin-master-all-last.jar is one of a few files in the hail-common bucket which arenā€™t public; he steered me to another pre-built shared copy that I could access, no problem:

gs://hail-common/hail-hail-is-master-all-spark2.0.2-b4659fa.jar

The latest hash to construct the tail end of this filename is apparently at:

https://storage.googleapis.com/hail-common/latest-hash.txt.

I also encountered some run issues depending on Dataproc cluster version. Attempting to run on 1.0 I get:

java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

I believe this is caused by variation in Scala used to build Hail versus what is available on Google Dataproc. I havenā€™t tested extensively yet, but Tim told me that image version 1.1 is now required; I found that the preview image (as of 12/9/2016) also works (cursorily).

The primary difference between dataproc images 1.0 and 1.1 is the version of spark: 1.0 uses Spark 1.6.2, while Dataproc 1.1 includes Spark 2.0.2. We have some code that relies on Spark interfaces that changed in 2.0, so we would not expect a jar compiled for Spark 1.6.2 to work on Dataproc 1.1, or a jar compiled for Spark 2.0.2 (the default now) to work on Dataproc 1.0.

Thanks for investigating this! Weā€™ll keep you updated as we think about better ways to keep a ā€œlatestā€ jar.

Useful write up - I am an independent consultant and Google Cloud Developer Expert and would be happy to answer any questions related to GCP services as related running HAIL on GCP.

Just a quick line to mention that Iā€™ve updated this post to reflect:

  1. The change to python interface
  2. The move to Spark 2 (image version 1.1)
  3. The best way to find the latest Hail jar
  4. Move of the VEP resources to gs://hail-common/vep/
1 Like

And it was time for another update!

  • Added the required initialization script for decorator
  • Logging now reflects the python interface
  • New much better python hail submit script
  • Advice on storage class
  • General small changes / fixes
2 Likes

@lfrancioli, can you change the recommended region to us-central1-b? The prebuilt JAR that we provide in Google Storage requires a Haswell or Broadwell CPU to run ibd. These processors are available in all of us-west, us-east, and asia-northwest; as well as europe-west1-d, us-central1-b , and us-central1-c [1].

[1] https://cloud.google.com/compute/docs/regions-zones/regions-zones

Could you add numpy to the dependencies listed here as well? I know that everyone is likely to have it anyway, but I ran into this problem when constructing a bare-bones Docker image.

@Joel_Thibault, as far as I know, vanilla Hail does not require the numpy package. All the matrices that Hail works with are distributed Spark matrices and we donā€™t yet support easy interaction with local python matrices. Do you recall what you executed that required numpy?

I believe the tutorial requires numpy, for some of the plotting, but, AFAICT, using running the tutorial does not require installing numpy on the Dataproc worker nodes.

In case it helps, I have also run into a numpy-related error when I tried to import Hail interactively on a Broad VM.

After upgrading to 53e9d33 I saw this immediately after running from hail import * in the pyspark shell. I did not see it on the version I was using earlier, fcb6cf1.

Thanks for the feedback @Joel_Thibault and @kaan, weā€™re looking into this on the backend. Iā€™ll try to eliminate this dependency and if I fail, Iā€™ll ensure itā€™s noted as a requirement.

@Joel_Thibault & @kaan, we were inadvertently pulling numpy via a Spark dependency that was unnecessarily module-level. We should have a fix into master in the very near future. Thanks for notifying us!

EDIT (2017-06-07 16:50 EST): Fix landed just now: https://github.com/hail-is/hail/pull/1906#pullrequestreview-42730318

2 Likes

These are the only two hash currently available (2017-08-28, 3:44am EST).
gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt
gs://hail-common/builds/0.1/latest-hash-spark-2.1.0.txt

Yeah, thatā€™s right. These ā€œlatest-hashā€ files contain the hash of the most recent version built, and the jars and python files can be found in a subdirectory:

$ gsutil ls gs://hail-common/builds/0.1
gs://hail-common/builds/0.1/latest-hash-spark-2.0.2.txt
gs://hail-common/builds/0.1/latest-hash-spark-2.1.0.txt
gs://hail-common/builds/0.1/jars/
gs://hail-common/builds/0.1/python/

Hey Tim, hypotheses is pointing out that the url given in the article isnā€™t the right oneā€“thereā€™s a missing hyphen.

Ah, missed that. Will edit the post now.

As an aside, I think we need a better system to show when info in posts like this is` out of date.