How to install hail on spark cluster

Aks · June 23, 2020, 7:16am

Hello Everyone,

I am new to this domain and trying to setup hail for the first time.

I would like to install the hail on a spark cluster

So be referring the documentation, I executed the below command

make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5

But I got the below error message

make: *** No rule to make target ‘install-on-cluster’. Stop.

However, I am able to successfully install the hail using pip command locally.

Can you guide me on what could be the issue that results in failure of make install-on-cluster command and how can I resolve this?

Aks · June 23, 2020, 8:25am

Is there any step step tutorial for beginners?

danking · June 23, 2020, 9:39pm

Hey @Aks,

Sorry you ran into this, are documentation is a little unclear. You need to run that command from the hail folder inside the hail repository.

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5

Aks · June 24, 2020, 2:05am

Hi @danking,

Thanks for the response. Appreciate your help. just trying to understand, so, once I download hail from git repository, may I check how is the connection between the spark installed in my server and hail is established?

Do we have to make any changes?

Thanks a ton.

danking · June 24, 2020, 2:49pm

You can verify Hail is installed correctly by executing this:

python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'

You should see a matrix of random genotypes. You should also verify that your Spark cluster has a corresponding Spark job. If you do not see any new jobs in your Spark cluster’s history viewer, then you are probably accidentally installed pyspark via pip which installs a version of Spark not intended for Spark clusters.

You should not need to make any changes after running make install-on-cluster ...

Aks · June 24, 2020, 3:50pm

Hi @danking,

Yes its working. I am able to see the matrix table. However one quick question

I am used to working on Jupyter notebook and python. So I wanted to use Pyspark. So typing below command from the instructions documentation,

(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')

I got the hail directory path which is like as shown below

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

Later when I run in Jupyter notebook, the following commands

hail_home = Path('/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail')
hail_jars = hail_home/'build'/'libs'/'hail-all-spark.jar'

conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '80g'),
    ('spark.executor.memory', '80g'),
    ('spark.local.dir', '/tmp,/data/volume03/spark')
])

sc = pyspark.SparkContext('local[*]', 'Hail', conf=conf)
hl.init(sc)

I get an error as shown below

TypeError: 'JavaPackage' object is not callable

Is anything wrong with my hail_home path?

I realize that I don’t have the folder build under hail_home which is causing issue while identifying java_package.

But the command in the doc, gives the below path for hail_home

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

post update

I see it’s under backend folder. May I check why the path is different? have I installed it in a incorrect location because documentation mentions other location.

So I updated hail_jars = hail_home/'backend'/'hail-all-spark.jar'

Now it works I guess

danking · June 24, 2020, 6:17pm

I think there’s a misunderstanding here. If you are using pyspark or spark-submit, then you need to specify special spark configuration variables. If you use python or jupyter notebook, you don’t need to specify anything.

I’m also confused about your use of local[*]. If you have a Spark cluster, you should not use that. That means you want to run in local-mode and not use the cluster at all.

If you have a Spark cluster and you want to use a Jupyter Notebook, do this:

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
python3 -m pip install jupyter
jupyter notebook --no-browser

Now, open a browser and connect to the Jupyter instance on your leader node: http://name-of-spark-leader-node:8888. There’s no need to configure spark conf, there’s no need to set up python jars. Just open a Python kernel and execute this:

import hail as hl
hl.init()

Aks · August 26, 2020, 6:10am

Hi @danking

A quick question. Though I follow the above instruction as is, I guess we need to install pyspark to be able to execute the below sample code

python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'

As you can see below the error message that I receive after executing the above hail installation verification command

(bio) abcd@server1:~/hail/hail$ python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/hail/__init__.py", line 31, in <module>
    from .table import Table, GroupedTable, asc, desc  # noqa: E402
  File "/home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/hail/table.py", line 4, in <module>
    import pyspark
ModuleNotFoundError: No module named 'pyspark'

In the prompt, bio indicates the python virtual environment.

However following the error message, if I install pyspark using pip install pyspark, I am able to see the matrix output

Can I kindly request you to correct me if I am wrong?

tpoterba · August 26, 2020, 11:04am

You probably need to change your PYTHONPATH to include the path to Spark files as here: https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm

tpoterba · August 26, 2020, 11:06am

You can also create a Python REPL or submit a script using the pyspark executable, which should be on your path.

Aks · September 14, 2020, 12:01pm

Hi @tpoterba - Appreciate your help

I keyed in the below settings in .bashrc file by following the tutorial link that you shared

export SPARK_HOME=/usr/spark
export PATH=$PATH:/usr/spark/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH

Later, I also run source .bashrc

After logging off and logging in from the server, I get the same error still which is

ModuleNotFoundError: No module named ‘pyspark’

Later when I navigate to the usr/spark folder and execute the below command

./bin/pyspark

I am able to see the output

Python 3.7.2 (default, Sep 14 2020, 18:01:09)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
20/09/14 20:07:21 WARN Utils: Your hostname, test resolves to a loopback address: 127.0.1.1; using xxx.xx.xx.xxx instead (on interface enp14s0)
20/09/14 20:07:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/14 20:07:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ / ._/_,// //_\ version 2.4.4
//

Using Python version 3.7.2 (default, Sep 14 2020 18:01:09)
SparkSession available as ‘spark’.

Can let me know how can I make hail make use of the pyspark with spark folder and not really install pyspark myself? I followed the tutorial as is but not sure whats the mistake

tpoterba · September 14, 2020, 4:13pm

from a terminal, can you run the following and paste the output:

$ echo $SPARK_HOME
$ echo $PYTHONPATH
$ which python3
$ python3 -c "import pyspark"

Aks · September 15, 2020, 7:12am

Hi @tpoterba,

Thanks. I managed to resolve a few errors but still, I have a few unresolved issues.

$ echo $SPARK_HOME - /usr/spark
$ echo $PYTHONPATH - /usr/spark/python:/usr/spark/python/lib/py4j-0.10.7-src.zip:
$ which python3 - /home/abc/.pyenv/shims/python3  # this is python version from my virtual environment (which is set to be global/system version)
$ python3 -c "import pyspark" - no errors and cursor moved to next line

So when I tried executing the balding_nicholas_model in my terminal using hail_script.py, I am able to get the output.

So, later I launched jupyter notebook using the following command.

jupyter notebook - This command launches jupyter notebook and provides URL address where it can be accessed like as shown below

So, I port forward them and open them in my local laptop. When I execute

import hail as hl

I get an error that ImportError: No module named hail

I also have an env variable set in .bashrc file which has HAIL_HOME=/usr/hail/hail

In jupyter notebook, when I issue `os.getenv(‘HAIL_HOME’), I get the same path above as output.

May I know what else should I be doing to make it all work in my jupyter notebook?

tpoterba · September 15, 2020, 11:02am

This means pyspark is properly installed with python3.

You shouldn’t ned to set HAIL_HOME. make install-on-cluster will install a distribution using pip. I think the most likely issue is that the jupyter notebook you’re running is using a different python kernel than the one you’ve listed above, so it doesn’t have any of the libraries you expect.

Topic		Replies	Views
Install Hail 0.2.57 on AWS Spark cluster Hail Query & hailctl	4	624	November 3, 2020
Build hail on cloud Hail Batch & General Cloud	1	443	May 28, 2022
Install hail and running on Cloudera Cluster Help [0.1]	5	734	September 10, 2018
Hail installation fail at No rule to make target `jni.h' even after setting JAVA_HOME Hail Query & hailctl	0	24	August 1, 2024
Install Hail using Spark Hail Query & hailctl	15	1380	April 13, 2018

How to install hail on spark cluster

Related topics