Hail 0.2 on EMR?

Hi, has anyone managed to get Hail 0.2 working on EMR? I’ve found bits and pieces but no straightforward guide…

Thanks,
Oron

I know that there are people running on EMR, but it’s something we should support in a more first-class way. Our team is busy with a hackathon today but could probably spare some time to think about this later in the week.

Here’s another thread where we’re trying to get things working on EMR:

Thanks, any help would be greatly appreciated…

Yeah having a problem as well. Based on @tpoterba previous suggestion I installed Anoconda with Python 3 on all of the nodes and then build a cluster using EMR 5.10.0 (Spark 2.2.0). When I try to use the pyspark command line I cannot even get the hl.init(sc) command to work.

[hadoop@ip-172-30-1-4 ~]$ pyspark --jars hail-all-spark.jar --py-files hail-python.zip
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58)
[GCC 7.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/03 23:18:29 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/05/03 23:18:40 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .
_/_,// //_\ version 2.2.0
/
/

Using Python version 3.6.5 (default, Mar 29 2018 18:21:58)
SparkSession available as ‘spark’.

import hail as hl
hl.init(sc)
Traceback (most recent call last):
File “”, line 1, in
AttributeError: module ‘hail’ has no attribute ‘init’

Are you using an 0.1 jar/python zip? init was added in 0.2.

Moreover, --jars is insufficient because it does not set the spark class path.

@chirag_lakhani , you’ll need to start spark like this:

pyspark --jars hail-all-spark.jar \
  --py-files hail-python.zip \
  --conf spark.driver.extraClassPath=./hail-all-spark.jar \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.sql.files.openCostInBytes=1099511627776 \
  --conf spark.sql.files.maxPartitionBytes=1099511627776 \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator

I see that these options are not clearly included in the Getting Started page for non-cloudera clusters. I will update this page today.

@tpoterba I am pretty sure I am using 0.2

Here is the shell script I used

HAIL_VERSION=“0.2”
SPARK_VERSION=“2.2.0”

/usr/bin/sudo pip install decorator

sudo yum update -y
sudo yum install g++ cmake git -y

git clone https://github.com/broadinstitute/hail.git
cd hail/
git checkout $HAIL_VERSION
./gradlew -Dspark.version=$SPARK_VERSION shadowJar archiveZip

cp $PWD/build/distributions/hail-python.zip $HOME
cp $PWD/build/libs/hail-all-spark.jar $HOME

echo “” >> $HOME/.bashrc
echo “export PYTHONPATH=${PYTHONPATH}:$HOME/hail-python.zip” >> $HOME/.bashrc

@danking I can try this and let you know what happens, thanks!

Hi chirag_lakhani,

I would like to know if you finally managed to install Hail 0.2 on AWS. If yes, what are the steps? If not what is the problem?

A big thank-you !

Hi,

We are struggling to get the CloudFormation template working for Hail 0.2 on EMR:

Does anyone have any success with this?

Qaiser

We are struggling with the same issue. Using the CloudFormation script works with hail v0.1 and spark 2.1.0, but the cluster shuts down after about 24 hours. The hail v0.1 examples work, while the cluster is up.

Trying to build hail v0.2 from the master branch, but having issues with that as well

Using the master branch from https://github.com/broadinstitute/hail.git results in

$ ./gradlew -Dspark.version=$SPARK_VERSION shadowJar archiveZip
a23032101373

FAILURE: Build failed with an exception.

* Where:
Build file '/home/hadoop/hail/build.gradle' line: 43

* What went wrong:
A problem occurred evaluating root project 'hail'.
> Hail does not support Spark version . Hail team recommends version 2.2.0.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 3.055 secs

while Spark version is 2.2.0

$ $SPARK_HOME/bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/
                        
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_171
Branch HEAD
Compiled by user ec2-user on 2017-11-16T09:36:32Z
Revision d73c901b4228f4e75d3a527ec2318ce7376036cb
Url git@aws157git.com:/pkg/Aws157BigTop
Type --help for more information.

we are using the AWS EMR 5.10.0, see EMR releases here which might be the culprit.

I was able to build the .jar file, by symlinking the missing .h file. The jni.h is stored in the installed Java8 directory, but not in /etc/alternatives/jre/include, which is where gradle is looking for it.

Using the AWS EMR 5.10.0 version, the following command

sudo ln -s /etc/alternatives/jre/include /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.37.amzn1.x86_64/include

should help. The make will run, and create the .jar and the .zip with warnings.

Next issue, is that now Hail 0.2 needs python3 :stuck_out_tongue:

<rant>
Excuse me, Mr. SIMPLE ERROR MESSAGE. It is python3.6, because the EMR comes with python3.4, but that is NOT enough for Hail. This is conveniently missing from the “slightly” updated documentation!
</rant>

It actually requires v3.6 of Python!

I bet you enjoy THIS live-tweeted misery than the Apple key-note. Now, that I know (well, I hope I know) which version of Python I need, I am hunting for the yum command to update it.

Check your release of LINUX here

cat /etc/os-release,

Cross your fingers, and type

sudo yum install python36 -y

AWS EMR reuires that you distribute the .jar and .zip files for hail on the other slaves. Might be obvious to everybody, but I am different.

Here is the script to test the setup (based on the hail.is v0.2 tutorial)

import hail as hl
import hail.expr.aggregators as agg
hl.init()
hl.utils.get_1kg('data/')
ds = hl.read_matrix_table('data/1kg.mt')
ds.rows().select().show(5)
ds.s.show(5)
ds.entry.take(5)
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
         .key_by('Sample'))
table.describe()
table.show(width=100)
print(ds.col.dtype)
hl.stop()

and here is the command to actually run it

export SPARK_HOME=/usr/lib/spark
# Yes, I did build it under the $HOME for hadoop, so what?
export HAIL_HOME=/home/hadoop/hail
export PYTHONPATH="$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"
# Do not forget to switch to Python v3.6
export PYSPARK_PYTHON=python3

# Note: Had to drop the .jar and .zip on /home/hadoop on the slave nodes!!!
spark-submit \
--jars /home/hadoop/hail-all-spark.jar \
--conf='spark.driver.extraClassPath=./hail-all-spark.jar' \
--conf='spark.executor.extraClassPath=./hail-all-spark.jar' \
--files /home/hadoop/hail-python.zip \
test.py

@gaborkorodi-hms – which version of Spark are you using? If you’re using 2.3.0 + then we can probably make this process much easier by using HTTPS for --jars / --py-files

We are running AWS EMR 5.10.0 at the moment, with Spark 2.2.0 and handrolling Hail devel-xxxx, Python3.6 and JupyterNotebook 0.9 at the moment.

We can switch to AWS EMR 5.14.X, which has Spark 2.3.0 and JupyterHub 0.8 and handroll the rest. Please indeed do let me know, if you have some easier way. CloudformationScript perhaps?!

Someone else pointed out that we’ll need to deploy jars compiled for 2.3.0 in order to make this possible. I’ll bring this up at our weekly checkin tomorrow.