How to install hail on spark cluster

Hello Everyone,

I am new to this domain and trying to setup hail for the first time.

I would like to install the hail on a spark cluster

So be referring the documentation, I executed the below command

make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5

But I got the below error message

make: *** No rule to make target ‘install-on-cluster’. Stop.

However, I am able to successfully install the hail using pip command locally.

Can you guide me on what could be the issue that results in failure of make install-on-cluster command and how can I resolve this?

Is there any step step tutorial for beginners?

Hey @Aks,

Sorry you ran into this, are documentation is a little unclear. You need to run that command from the hail folder inside the hail repository.

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
1 Like

Hi @danking,

Thanks for the response. Appreciate your help. just trying to understand, so, once I download hail from git repository, may I check how is the connection between the spark installed in my server and hail is established?

Do we have to make any changes?

Thanks a ton.

You can verify Hail is installed correctly by executing this:

python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'

You should see a matrix of random genotypes. You should also verify that your Spark cluster has a corresponding Spark job. If you do not see any new jobs in your Spark cluster’s history viewer, then you are probably accidentally installed pyspark via pip which installs a version of Spark not intended for Spark clusters.

You should not need to make any changes after running make install-on-cluster ...

1 Like

Hi @danking,

Yes its working. I am able to see the matrix table. However one quick question

I am used to working on Jupyter notebook and python. So I wanted to use Pyspark. So typing below command from the instructions documentation,

(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')

I got the hail directory path which is like as shown below

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

Later when I run in Jupyter notebook, the following commands

hail_home = Path('/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail')
hail_jars = hail_home/'build'/'libs'/'hail-all-spark.jar'

conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '80g'),
    ('spark.executor.memory', '80g'),
    ('spark.local.dir', '/tmp,/data/volume03/spark')
])

sc = pyspark.SparkContext('local[*]', 'Hail', conf=conf)
hl.init(sc)

I get an error as shown below

TypeError: 'JavaPackage' object is not callable

Is anything wrong with my hail_home path?

I realize that I don’t have the folder build under hail_home which is causing issue while identifying java_package.

But the command in the doc, gives the below path for hail_home

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

post update

I see it’s under backend folder. May I check why the path is different? have I installed it in a incorrect location because documentation mentions other location.

So I updated hail_jars = hail_home/'backend'/'hail-all-spark.jar'

Now it works I guess

I think there’s a misunderstanding here. If you are using pyspark or spark-submit, then you need to specify special spark configuration variables. If you use python or jupyter notebook, you don’t need to specify anything.

I’m also confused about your use of local[*]. If you have a Spark cluster, you should not use that. That means you want to run in local-mode and not use the cluster at all.

If you have a Spark cluster and you want to use a Jupyter Notebook, do this:

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
python3 -m pip install jupyter
jupyter notebook --no-browser

Now, open a browser and connect to the Jupyter instance on your leader node: http://name-of-spark-leader-node:8888. There’s no need to configure spark conf, there’s no need to set up python jars. Just open a Python kernel and execute this:

import hail as hl
hl.init()

Hi @danking

A quick question. Though I follow the above instruction as is, I guess we need to install pyspark to be able to execute the below sample code

python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'

As you can see below the error message that I receive after executing the above hail installation verification command

(bio) abcd@server1:~/hail/hail$ python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/hail/__init__.py", line 31, in <module>
    from .table import Table, GroupedTable, asc, desc  # noqa: E402
  File "/home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/hail/table.py", line 4, in <module>
    import pyspark
ModuleNotFoundError: No module named 'pyspark'

In the prompt, bio indicates the python virtual environment.

However following the error message, if I install pyspark using pip install pyspark, I am able to see the matrix output

Can I kindly request you to correct me if I am wrong?

You probably need to change your PYTHONPATH to include the path to Spark files as here: https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm

1 Like

You can also create a Python REPL or submit a script using the pyspark executable, which should be on your path.

Hi @tpoterba - Appreciate your help

I keyed in the below settings in .bashrc file by following the tutorial link that you shared

export SPARK_HOME=/usr/spark
export PATH=$PATH:/usr/spark/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH

Later, I also run source .bashrc

After logging off and logging in from the server, I get the same error still which is

ModuleNotFoundError: No module named ‘pyspark’

Later when I navigate to the usr/spark folder and execute the below command

./bin/pyspark

I am able to see the output

Python 3.7.2 (default, Sep 14 2020, 18:01:09)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
20/09/14 20:07:21 WARN Utils: Your hostname, test resolves to a loopback address: 127.0.1.1; using xxx.xx.xx.xxx instead (on interface enp14s0)
20/09/14 20:07:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/14 20:07:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .
_/_,// //_\ version 2.4.4
/
/

Using Python version 3.7.2 (default, Sep 14 2020 18:01:09)
SparkSession available as ‘spark’.

Can let me know how can I make hail make use of the pyspark with spark folder and not really install pyspark myself? I followed the tutorial as is but not sure whats the mistake

from a terminal, can you run the following and paste the output:

$ echo $SPARK_HOME
$ echo $PYTHONPATH
$ which python3
$ python3 -c "import pyspark"
1 Like

Hi @tpoterba,

Thanks. I managed to resolve a few errors but still, I have a few unresolved issues.

$ echo $SPARK_HOME - /usr/spark
$ echo $PYTHONPATH - /usr/spark/python:/usr/spark/python/lib/py4j-0.10.7-src.zip:
$ which python3 - /home/abc/.pyenv/shims/python3  # this is python version from my virtual environment (which is set to be global/system version)
$ python3 -c "import pyspark" - no errors and cursor moved to next line

So when I tried executing the balding_nicholas_model in my terminal using hail_script.py, I am able to get the output.

So, later I launched jupyter notebook using the following command.

jupyter notebook - This command launches jupyter notebook and provides URL address where it can be accessed like as shown below

So, I port forward them and open them in my local laptop. When I execute

import hail as hl

I get an error that ImportError: No module named hail

I also have an env variable set in .bashrc file which has HAIL_HOME=/usr/hail/hail

In jupyter notebook, when I issue `os.getenv(‘HAIL_HOME’), I get the same path above as output.

May I know what else should I be doing to make it all work in my jupyter notebook?

This means pyspark is properly installed with python3.

You shouldn’t ned to set HAIL_HOME. make install-on-cluster will install a distribution using pip. I think the most likely issue is that the jupyter notebook you’re running is using a different python kernel than the one you’ve listed above, so it doesn’t have any of the libraries you expect.