Issue building Hail 0.2 on EMR

We are using the AWS Hail build script s3://aws-bigdata-blog/artifacts/hail-on-emr/build_hail.sh (parameters EMR 5.12.1, Spark 2.2.0) to build Hail 0.2. This is implemented as one of several boot strap actions in our Cloud Formation template.

We last updated our stack with a commit from 5/30 and wanted to get the latest changes before posting on another issue.

When updating the stack with commit 9b8f5ef on 6/15, the build failed. We narrowed it down and found that commit 172da95 on 6/4 builds successfully but commit b12fb45 on 6/5 fails.

The part of the build script that fails is:

curl -L -O https://storage.googleapis.com/hail-common/libsimdpp-2.0-rc2.tar.gz
tar -xzf libsimdpp-2.0-rc2.tar.gz
g++ -O3 -march=native -std=c++11 -Ilibsimdpp-2.0-rc2 -Wall -Werror -fPIC -ggdb -fno-strict-aliasing -I…/resources/include -I/etc/alternatives/jre/include -I/etc/alternatives/jre/include/linux -c -o ibs.o ibs.cpp
g++ -O3 -march=native -std=c++11 -Ilibsimdpp-2.0-rc2 -Wall -Werror -fPIC -ggdb -fno-strict-aliasing -I…/resources/include -I/etc/alternatives/jre/include -I/etc/alternatives/jre/include/linux -c -o davies.o davies.cpp
touch headers
g++ -O3 -march=native -std=c++11 -Ilibsimdpp-2.0-rc2 -Wall -Werror -fPIC -ggdb -fno-strict-aliasing -I…/resources/include -I/etc/alternatives/jre/include -I/etc/alternatives/jre/include/linux -c -o NativeCodeSuite.o NativeCodeSuite.cpp
g++ -O3 -march=native -std=c++11 -Ilibsimdpp-2.0-rc2 -Wall -Werror -fPIC -ggdb -fno-strict-aliasing -I…/resources/include -I/etc/alternatives/jre/include -I/etc/alternatives/jre/include/linux -c -o NativeLongFunc.o NativeLongFunc.cpp
:nativeLib FAILED

BUILD FAILED

Did something change on June 5th in terms of the build process? To date using the referenced script has served us well for both Hail 0.1 and earlier 0.2 releases.

Sorry, this got caught in the discourse spam autofilter for some reason. Will look into the settings that triggered that.

You shouldn’t really need to build Hail yourself! We deploy a jar and zip for each commit to our public Google bucket. Spark does have built-in functionality for adding jars and zips from arbitrary URIs, and this supports https as of Spark 2.3.0.

I’m not totally sure how EMR works, but if you have access to a spark-submit-like API, then you can use --jars and --py-files with our jar/zip URLs:

HASH=`curl https://storage.googleapis.com/hail-common/builds/devel/latest-hash-spark-2.2.0.txt`
spark-submit \
--jars https://storage.googleapis.com/hail-common/builds/devel/jars/hail-devel-$HASH-Spark-2.2.0.jar \
--py-files https://storage.googleapis.com/hail-common/builds/devel/python/hail-devel-$HASH.zip

This may not be helpful depending on the EMR API.

Does Hail 0.2 support Spark 2.3.0? There was a comment by Dan King on gitter from a few days ago indicating 2.3.0 was not supported.

We’ll discuss implications of moving to a non-build deployment process. Earlier on it seemed like the build on the actual cluster would introduce less complications / issues around syncing dependencies and was recommended by the AWS team.

Ah, Dan’s right that we probably shouldn’t use our 2.2.0 jars for 2.3.0. I do think it’s possible to compile Hail against Spark 2.3.0 though.

We haven’t been in touch with the team that wrote that blog post, so we’re not super familiar with the specific AWS needs and restrictions.