Running hail on AWS EMR

Hi,

I apologize if this is a dumb question but I’m a Hail newbie. I am trying to simply run Hail on Amazon EMR. I have setup a small EMR cluster using emr-5.13.0 which has Spark 2.3.0. Following some scripts built by Amazon I logged into the master node and ran the following

HAIL_VERSION=“0.1”
SPARK_VERSION=“2.3.0”

/usr/bin/sudo pip install decorator

sudo yum update -y
sudo yum install g++ cmake git -y

git clone https://github.com/broadinstitute/hail.git
cd hail/
git checkout $HAIL_VERSION
./gradlew -Dspark.version=$SPARK_VERSION shadowJar archiveZip

cp $PWD/build/distributions/hail-python.zip $HOME
cp $PWD/build/libs/hail-all-spark.jar $HOME

echo “” >> $HOME/.bashrc
echo “export PYTHONPATH=${PYTHONPATH}:$HOME/hail-python.zip” >> $HOME/.bashrc

afterwards I tried to start pyspark using the following command:

pyspark --jars hail-all-spark.jar --py-files hail-python.zip

as I try to import Hail I get the following error

sc.addFile(’/home/hadoop/hail-all-spark.jar’)
sc.addPyFile(’/home/hadoop/hail-python.zip’)
from hail import *
Traceback (most recent call last):
File “”, line 1, in
File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/init.py”, line 1, in

#

File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/expr.py”, line 3, in
File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/representation/init.py”, line 1, in
#
File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/representation/variant.py”, line 2, in
File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/typecheck/init.py”, line 1, in
#
File “/mnt/tmp/spark-7970940d-c351-48d6-be82-1c2cb4647b24/userFiles-7fa2e70a-b1b5-48a3-bcf8-fba119200a8a/hail-python.zip/hail/typecheck/check.py”, line 1, in
ImportError: cannot import name getargspec

hc = HailContext(sc)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘HailContext’ is not defined

Could someone explain why this is happening?

Just made a post about this:

however, I don’t think Hail 0.1 is compatible with Spark 2.3.0 anyway. You should use the 0.2 beta version if you’re just getting started!

Thanks so much, yes I just tried it but am realizing that pyspark on EMR uses python 2.7, however you guys use Python 3.6. Do you know how one would change that?

here’s something that might help: https://medium.com/@datitran/quickstart-pyspark-with-anaconda-on-aws-660252b88c9a

1 Like

This is great, thanks so much!

If you use EMR release 5.12.1 you can install python 3.6 as a bootstrap action on the cluster.