Hail 0.2 on glue

I am facing below error when trying to import hail on glue (spark3, python2, glue version 1.0). Any help is much appreciated:

21/06/29 21:47:11 ERROR ApplicationMaster: User application exited with status 1

import sys, os
from pyspark import SparkContext, SparkConf
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from hail import *
#import hail as hl

conf = SparkConf()
conf.set(‘spark.app.name’, u’Running Hail on Glue’)
conf.set(‘spark.sql.files.maxPartitionBytes’, ‘1099511627776’)
conf.set(‘spark.sql.files.openCostInBytes’, ‘1099511627776’)
conf.set(‘spark.kryo.registrator’, ‘is.hail.kryo.HailKryoRegistrator’)
conf.set(‘spark.serializer’, ‘org.apache.spark.serializer.KryoSerializer’)

sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set(“mapred.output.committer.class”, “org.apache.hadoop.mapred.FileOutputCommitter”)
sc.getConf().getAll()

glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
#job.init(args[‘JOB_NAME’], args)
hc = HailContext(sc)
#hl.init(sc)

print(“Hello World!!!”)

when I change the import statements as shown below I see this error:

Py4JError: An error occurred while calling z:is.hail.HailContext.apply. Trace:

#from hail import *
import hail as hl

#hc = HailContext(sc)
hl.init(sc)

also want to add that when using python 3 I see this error:

ModuleNotFoundError: No module named ‘SocketServer’

What version of Hail are you using? There’s no mention of “SocketServer” anywhere in the code base right now.

I am using hail 2.0

What’s the full version? pip show hail?

I am trying to rebuild (steps below) on an ec2 with amzn linux and seeing this error :

FAILURE: Build failed with an exception.

  • Where:
    Build file ‘/home/ec2-user/hail/hail/build.gradle’ line: 194

  • What went wrong:
    A problem occurred evaluating root project ‘hail’.

assert(scalaMajorVersion == “2.11”)
| |
‘2.12’ false

sudo yum install -y g++ cmake git
sudo yum install -y lz4
sudo yum install -y lz4-devel
git clone GitHub - hail-is/hail: Cloud-native genomic dataframes and batch computing

cd hail/hail && git fetch && git checkout

sudo yum groupinstall ‘Development Tools’
sudo yum install java-1.8.0
sudo alternatives --config java
sudo yum search java | grep openjdk
sudo yum install java-1.8.0-openjdk-headless.x86_64
sudo yum install java-1.8.0-openjdk-devel.x86_64
sudo update-alternatives --config java
sudo update-alternatives --config javac

make install HAIL_COMPILE_NATIVES=1 SPARK_VERSION=2.4.4

any help please ? I am basically stuck big time getting hail 0.2 working on glue

Try this:

make install HAIL_COMPILE_NATIVES=1 SPARK_VERSION=2.4.4 SCALA_VERSION=2.11.12

Thanks, I was able to build on ec2 (amazo linux). But when I try to use these files on aws glue I see this error:

import hail as hl
ModuleNotFoundError: No module named ‘hail’

[ec2-user@ip-172-31-50-72 ~]$ pip show hail

Name: hail

Version: 0.2.74

Summary: Scalable library for exploring and analyzing genomic data.

Home-page: https://hail.is

Author: Hail Team

Author-email: hail@broadinstitute.org

License: UNKNOWN

Location: /home/ec2-user/.local/lib/python3.7/site-packages

Requires: aiohttp, humanize, janus, pandas, asyncinit, decorator, aiohttp-session, pyspark, google-cloud-storage, tabulate, python-json-logger, scipy, bokeh, tqdm, dill, botocore, gcsfs, boto3, PyJWT, nest-asyncio, Deprecated, fsspec, requests, numpy, parsimonious, hurry.filesize

I got past the previous error but currently stuck at the one below. Please help if you can:

File “/tmp/test_08_02”, line 8, in
import hail as hl
File “/tmp/hail-python.zip/hail/init.py”, line 44, in
from .table import Table, GroupedTable, asc, desc # noqa: E402
File “/tmp/hail-python.zip/hail/table.py”, line 7, in
from hail.expr.expressions import Expression, StructExpression,
File “/tmp/hail-python.zip/hail/expr/init.py”, line 1, in
from .types import dtype, HailType, hail_type, is_container, is_compound,
File “/tmp/hail-python.zip/hail/expr/types.py”, line 10, in
from hail import genetics
File “/tmp/hail-python.zip/hail/genetics/init.py”, line 1, in
from .call import Call
File “/tmp/hail-python.zip/hail/genetics/call.py”, line 2, in
from hail.utils import FatalError
File “/tmp/hail-python.zip/hail/utils/init.py”, line 8, in
from .tutorial import get_1kg, get_hgdp, get_movie_lens
File “/tmp/hail-python.zip/hail/utils/tutorial.py”, line 7, in
from hailtop.utils import sync_retry_transient_errors
File “/tmp/hail-python.zip/hailtop/utils/init.py”, line 2, in
from .utils import (
File “/tmp/hail-python.zip/hailtop/utils/utils.py”, line 19, in
import google.auth.exceptions
ModuleNotFoundError: No module named ‘google’

pip install --upgrade --target=/home/ec2-user/hail/hail/python/ google

import hail as hl
File “/tmp/hail-python.zip/hail/init.py”, line 44, in
from .table import Table, GroupedTable, asc, desc # noqa: E402
File “/tmp/hail-python.zip/hail/table.py”, line 7, in
from hail.expr.expressions import Expression, StructExpression,
File “/tmp/hail-python.zip/hail/expr/init.py”, line 1, in
from .types import dtype, HailType, hail_type, is_container, is_compound,
File “/tmp/hail-python.zip/hail/expr/types.py”, line 10, in
from hail import genetics
File “/tmp/hail-python.zip/hail/genetics/init.py”, line 1, in
from .call import Call
File “/tmp/hail-python.zip/hail/genetics/call.py”, line 2, in
from hail.utils import FatalError
File “/tmp/hail-python.zip/hail/utils/init.py”, line 8, in
from .tutorial import get_1kg, get_hgdp, get_movie_lens
File “/tmp/hail-python.zip/hail/utils/tutorial.py”, line 7, in
from hailtop.utils import sync_retry_transient_errors
File “/tmp/hail-python.zip/hailtop/utils/init.py”, line 2, in
from .utils import (
File “/tmp/hail-python.zip/hailtop/utils/utils.py”, line 19, in
import google.auth.exceptions
ModuleNotFoundError: No module named ‘google’

pip install --upgrade google-auth google-auth-httplib2 google-api-python-client

import hail as hl
File “/tmp/hail-python.zip/hail/init.py”, line 44, in
from .table import Table, GroupedTable, asc, desc # noqa: E402
File “/tmp/hail-python.zip/hail/table.py”, line 7, in
from hail.expr.expressions import Expression, StructExpression,
File “/tmp/hail-python.zip/hail/expr/init.py”, line 1, in
from .types import dtype, HailType, hail_type, is_container, is_compound,
File “/tmp/hail-python.zip/hail/expr/types.py”, line 10, in
from hail import genetics
File “/tmp/hail-python.zip/hail/genetics/init.py”, line 1, in
from .call import Call
File “/tmp/hail-python.zip/hail/genetics/call.py”, line 2, in
from hail.utils import FatalError
File “/tmp/hail-python.zip/hail/utils/init.py”, line 8, in
from .tutorial import get_1kg, get_hgdp, get_movie_lens
File “/tmp/hail-python.zip/hail/utils/tutorial.py”, line 7, in
from hailtop.utils import sync_retry_transient_errors
File “/tmp/hail-python.zip/hailtop/utils/init.py”, line 2, in
from .utils import (
File “/tmp/hail-python.zip/hailtop/utils/utils.py”, line 19, in
import google.auth.exceptions
ModuleNotFoundError: No module named 'google

Just bumping up this thread as I’m trying to achieve the same thing. Is there a recommended way to run Hail in AWS Glue? I tried the following -

  • I pulled down the main branch from Github and ran the ./gradlew to bundle up the JAR. Added “hail-all-spark.jar” in “Python library path” → This gave me a “ModuleNotFoundError: No module named ‘hail’” error
  • “–additional-python-modules” with “hail==0.2.78” but this complained about “ImportError: cannot import name ‘Markup’ from ‘jinja2’ (/home/spark/.local/lib/python3.7/site-packages/jinja2/init.py)” so I tried to add “jinja2==3.0.3” but then it gave me “TypeError: ‘JavaPackage’ object is not callable” error.

What is the recommended approach to add Hail library to AWS Glue? Similarly for EMR?

Hmm. There was an AWS EMR supported solution but it appears AWS has dropped support: Partner Solution not available—Amazon Web Services (AWS)

Do you happen to know if AWS Glue with Hail was ever supported?

From documentation, it seems like Glue supports adding Python libraries through using a JAR

Only pure Python libraries can be used. Libraries that rely on C extensions, such as pandas (Python data analysis) library, are not yet supported.

Ref: See #6 in Providing your own custom scripts - AWS Glue

However, I don’t think this would work because Hail is exposed as a Python library but it’s built in Scala, Spark and C++. So it’s not a pure Python library.

The other alternative way I found is to include the python modules through pip install with --additional-python-modules (ref: Using Python libraries with AWS Glue - AWS Glue).

Is there any way to find a complete list of dependency libraries that Hail requires? Perhaps I can try including all the libraries needed to see if the installation will be successful.

We never supported AWS Glue. I don’t know if Amazon / AWS Glue ever supported Hail.

You’re correct,. Hail is not a pure Python library. It needs a lot of special configuration. We describe how to install Hail on an arbitrary Spark cluster here.