Hail on Databricks with Spark Cluster

nnekaede · May 29, 2018, 5:49pm

I am new to Hail. I have been trying to use the hail tutorials on databricks using the spark cluster. However, when I try to import hail with import hail as * but I get the error ImportError: No module named hail. What am I doing wrong?

tpoterba · May 29, 2018, 6:02pm

Hi,
Some background that will help:

Hail currently has two versions, 0.1 (the stable version, >1 year old) and 0.2 (a beta version, MUCH better!). We recommend all new users start with 0.2.

There’s a tutorial on Databricks, but it’s from 0.1 days. We haven’t yet updated it to 0.2, though that’s something we should do soon. Is it possible for you to run locally for now (on a laptop or local server)? There’s a prepackaged distribution available for download here.

nnekaede · May 29, 2018, 6:33pm

Thank you! Is it necessary to have Anaconda for Python 3 because I have it for Python 2?

tpoterba · May 29, 2018, 6:34pm

Hail 0.1 supported Python 2.7 (and NOT Python 3!). Hail 0.2 supports only Python 3.6 and later. Sorry to be so confusing and inconsistent!

You can use any version of Anaconda, though, because Anaconda 2 can install Python 3 into an environment and vice versa.

If you’re doing this:

conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml
source activate hail

You should be fine!

nnekaede · May 29, 2018, 6:46pm

Sorry for all the questions but the tutorial you sent says I am supposed to use bash commands to set the path. Is it possible to use Windows or the Anaconda prompt?

tpoterba · May 29, 2018, 7:03pm

agh, we’ve tried to set it up and failed on windows in the past.

Let’s try to make it work on Databricks. I’ve emailed one of our contacts at Databricks and asked him to respond here to answer a few questions –

Can Databricks run Python 3.6? Hail 0.2 only supports Python 3.6 and later, due to important changes to keyword arguments.
If so, Does Databricks have a Spark 2.2.0 / 2.2.1 deployment working with Py 3.6?

yong · May 30, 2018, 6:26pm

hi there!

Databricks does support the Python 3.6 via Anaconda. It should work with Spark 2.2.0/2.2.1 as Tim specified. Below are the steps to set it up. Please feel free to reach out to us if you run into any issues (yong_at_databricks)

Step 1
Download the Anaconda Python distribution from https://www.continuum.io/downloads:

%sh curl -O https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh -O /dbfs/tmp/

Step 2
You will want to make sure all of the packages Databricks includes by default in Databricks and PySpark are replicated in Anaconda so create list of python packages and save in DBFS location

%sh  /databricks/python/bin/pip freeze > /tmp/python_packages.txt
%fs cp file:/tmp/python_packages.txt /tmp/python_packages.txt

Step 3
Use an init script to change the default Python distribution.

dbutils.fs.put("dbfs:/databricks/init/testpython3.6/install_conda1.sh", """#!/bin/bash
sudo bash /dbfs/tmp/Anaconda3-5.1.0-Linux-x86_64.sh -b -p /anaconda3
mv /databricks/python /databricks/python_old
ln -s /anaconda3 /databricks/python
/databricks/python/bin/pip install -r /dbfs/tmp/python_packages.txt
""")

Now after cluster restarts verify python was upgraded and available via Notebook as needed.

Check with

%python
print(sys.version)

nnekaede · May 31, 2018, 6:57am

Are these Windows Command Prompt calls?

tpoterba · May 31, 2018, 12:30pm

The things in the last post like

%sh curl ...

are designed to be entered into a Databricks notebook, I believe. The % indicates an “IPython magic”, which does something special in a Python notebook setting (%sh runs shell commands, %fs interacts with databricks file system, etc).

nnekaede · June 1, 2018, 4:15pm

I ran the code in hail, restarted the cluster, and then checked to see if it python was upgraded and I got an error:

name 'sys' is not defined

Is this because it in Anaconda for Linux and am running Windows?

nnekaede · June 1, 2018, 4:23pm

Also, I have been trying to use the hail-deployment notebook from the Hail tutorial, and it is asking me to Download Hail’s Java and Python libraries built with the latest stable version of Hail for Spark 2.1.1 and Scala 2.11. I can not find these files to download. I click on the links and it sends to a google doc and asks me to request permission to access it. I did so using the following emails: nede1@student.pvamu.edu and roaringlemurs@gmail.com.

tpoterba · June 18, 2018, 1:25pm

I’ve been on vacation, sorry to leave this hanging!

I think the code @yong posted is meant to be run inside a Databricks notebook, not locally on your laptop. Could you try that? Also, which is the hail deployment notebook you’re looking at – can you post the link?

Topic		Replies	Views
Installing / Using HAIL on Databricks Help [0.1]	6	1023	August 8, 2019
Using Hail0.2 on Azure Databrick Hail Query & hailctl	4	489	October 16, 2019
Install Hail using Spark Hail Query & hailctl	15	1384	April 13, 2018
Hail 0.2.27 not running on Cloudera spark 2.4.0 Hail Query & hailctl	8	658	November 19, 2019
How to install hail on spark cluster Hail Query & hailctl	13	1680	September 15, 2020

Hail on Databricks with Spark Cluster

Related topics