Hail on Databricks with Spark Cluster

I am new to Hail. I have been trying to use the hail tutorials on databricks using the spark cluster. However, when I try to import hail with import hail as * but I get the error ImportError: No module named hail. What am I doing wrong?

Hi,
Some background that will help:

Hail currently has two versions, 0.1 (the stable version, >1 year old) and 0.2 (a beta version, MUCH better!). We recommend all new users start with 0.2.

There’s a tutorial on Databricks, but it’s from 0.1 days. We haven’t yet updated it to 0.2, though that’s something we should do soon. Is it possible for you to run locally for now (on a laptop or local server)? There’s a prepackaged distribution available for download here.

Thank you! Is it necessary to have Anaconda for Python 3 because I have it for Python 2?

Hail 0.1 supported Python 2.7 (and NOT Python 3!). Hail 0.2 supports only Python 3.6 and later. Sorry to be so confusing and inconsistent!

You can use any version of Anaconda, though, because Anaconda 2 can install Python 3 into an environment and vice versa.

If you’re doing this:

conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml
source activate hail

You should be fine!

Sorry for all the questions but the tutorial you sent says I am supposed to use bash commands to set the path. Is it possible to use Windows or the Anaconda prompt?

agh, we’ve tried to set it up and failed on windows in the past.

Let’s try to make it work on Databricks. I’ve emailed one of our contacts at Databricks and asked him to respond here to answer a few questions –

  1. Can Databricks run Python 3.6? Hail 0.2 only supports Python 3.6 and later, due to important changes to keyword arguments.
  2. If so, Does Databricks have a Spark 2.2.0 / 2.2.1 deployment working with Py 3.6?
1 Like

hi there!

Databricks does support the Python 3.6 via Anaconda. It should work with Spark 2.2.0/2.2.1 as Tim specified. Below are the steps to set it up. Please feel free to reach out to us if you run into any issues (yong_at_databricks)

Step 1
Download the Anaconda Python distribution from https://www.continuum.io/downloads:

%sh curl -O https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh -O /dbfs/tmp/

Step 2
You will want to make sure all of the packages Databricks includes by default in Databricks and PySpark are replicated in Anaconda so create list of python packages and save in DBFS location

%sh  /databricks/python/bin/pip freeze > /tmp/python_packages.txt
%fs cp file:/tmp/python_packages.txt /tmp/python_packages.txt

Step 3
Use an init script to change the default Python distribution.

dbutils.fs.put("dbfs:/databricks/init/testpython3.6/install_conda1.sh", """#!/bin/bash
sudo bash /dbfs/tmp/Anaconda3-5.1.0-Linux-x86_64.sh -b -p /anaconda3
mv /databricks/python /databricks/python_old
ln -s /anaconda3 /databricks/python
/databricks/python/bin/pip install -r /dbfs/tmp/python_packages.txt
""")

Now after cluster restarts verify python was upgraded and available via Notebook as needed.

Check with

%python
print(sys.version)

Are these Windows Command Prompt calls?

The things in the last post like

%sh curl ...

are designed to be entered into a Databricks notebook, I believe. The % indicates an “IPython magic”, which does something special in a Python notebook setting (%sh runs shell commands, %fs interacts with databricks file system, etc).

1 Like

I ran the code in hail, restarted the cluster, and then checked to see if it python was upgraded and I got an error:

name 'sys' is not defined

Is this because it in Anaconda for Linux and am running Windows?

Also, I have been trying to use the hail-deployment notebook from the Hail tutorial, and it is asking me to Download Hail’s Java and Python libraries built with the latest stable version of Hail for Spark 2.1.1 and Scala 2.11. I can not find these files to download. I click on the links and it sends to a google doc and asks me to request permission to access it. I did so using the following emails: nede1@student.pvamu.edu and roaringlemurs@gmail.com.

I’ve been on vacation, sorry to leave this hanging!

I think the code @yong posted is meant to be run inside a Databricks notebook, not locally on your laptop. Could you try that? Also, which is the hail deployment notebook you’re looking at – can you post the link?