How to install hail on spark cluster

Hello Everyone,

I am new to this domain and trying to setup hail for the first time.

I would like to install the hail on a spark cluster

So be referring the documentation, I executed the below command

make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5

But I got the below error message

make: *** No rule to make target ‘install-on-cluster’. Stop.

However, I am able to successfully install the hail using pip command locally.

Can you guide me on what could be the issue that results in failure of make install-on-cluster command and how can I resolve this?

Is there any step step tutorial for beginners?

Hey @Aks,

Sorry you ran into this, are documentation is a little unclear. You need to run that command from the hail folder inside the hail repository.

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
1 Like

Hi @danking,

Thanks for the response. Appreciate your help. just trying to understand, so, once I download hail from git repository, may I check how is the connection between the spark installed in my server and hail is established?

Do we have to make any changes?

Thanks a ton.

You can verify Hail is installed correctly by executing this:

python3 -c 'import hail as hl; hl.balding_nichols_model(3, 1000, 1000).show()'

You should see a matrix of random genotypes. You should also verify that your Spark cluster has a corresponding Spark job. If you do not see any new jobs in your Spark cluster’s history viewer, then you are probably accidentally installed pyspark via pip which installs a version of Spark not intended for Spark clusters.

You should not need to make any changes after running make install-on-cluster ...

1 Like

Hi @danking,

Yes its working. I am able to see the matrix table. However one quick question

I am used to working on Jupyter notebook and python. So I wanted to use Pyspark. So typing below command from the instructions documentation,

(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')

I got the hail directory path which is like as shown below

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

Later when I run in Jupyter notebook, the following commands

hail_home = Path('/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail')
hail_jars = hail_home/'build'/'libs'/'hail-all-spark.jar'

conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '80g'),
    ('spark.executor.memory', '80g'),
    ('spark.local.dir', '/tmp,/data/volume03/spark')
])

sc = pyspark.SparkContext('local[*]', 'Hail', conf=conf)
hl.init(sc)

I get an error as shown below

TypeError: 'JavaPackage' object is not callable

Is anything wrong with my hail_home path?

I realize that I don’t have the folder build under hail_home which is causing issue while identifying java_package.

But the command in the doc, gives the below path for hail_home

/home/abcd/.pyenv/versions/3.7.2/envs/bio/lib/python3.7/site-packages/hail

post update

I see it’s under backend folder. May I check why the path is different? have I installed it in a incorrect location because documentation mentions other location.

So I updated hail_jars = hail_home/'backend'/'hail-all-spark.jar'

Now it works I guess

I think there’s a misunderstanding here. If you are using pyspark or spark-submit, then you need to specify special spark configuration variables. If you use python or jupyter notebook, you don’t need to specify anything.

I’m also confused about your use of local[*]. If you have a Spark cluster, you should not use that. That means you want to run in local-mode and not use the cluster at all.

If you have a Spark cluster and you want to use a Jupyter Notebook, do this:

git clone https://github.com/hail-is/hail.git
cd hail
cd hail
make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
python3 -m pip install jupyter
jupyter notebook --no-browser

Now, open a browser and connect to the Jupyter instance on your leader node: http://name-of-spark-leader-node:8888. There’s no need to configure spark conf, there’s no need to set up python jars. Just open a Python kernel and execute this:

import hail as hl
hl.init()