I admit that my setup is a bit messy… but I do not have total control on the stack.
let me give you some explanations.
- I create a cluster with spark installed on it
- I compile and install Hail on the cluster
- I configure spark to work with hail
spark.jars: /opt/hail/hail-all-spark.jar
spark.driver.extraClassPath: /opt/hail/hail-all-spark.jar:....
spark.executor.extraClassPath: /opt/hail/hail-all-spark.jar:...
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator: is.hail.kryo.HailKryoRegistrator
PYSPARK_PYTHON: python3
PYTHONPATH: $PYTHONPATH:/opt/hail/python
With this config, I am able to run Pyspark and Hail on my cluster.
The notebook is actually running in a different instance that communicate with my cluster. I am not fully sure of the situation there…
from the notebook, on python3 kernel, I got
import sys
print(sys.executable)
// /opt/conda/bin/python
print(sys.path)
// ['/home/notebook/work', '/opt/conda/lib/python37.zip', '/opt/conda/lib/python3.7', '/opt/conda/lib/python3.7/lib-dynload', '', '/opt/conda/lib/python3.7/site-packages', '/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/opt/conda/lib/python3.7/site-packages/IPython/extensions', '/home/notebook/.ipython']
import os
os.popen("python --version").read()
// Python 3.7.3
os.popen("which python").read()
// /opt/conda/bin/python
os.popen("which pip3").read()
// /opt/conda/bin/pip3
os.popen("python -m pip show hail").read()
//
os.popen("which iphyton").read()
//
os.popen("which jupyter").read()
// /opt/conda/bin/jupyter
os.popen("which pyspark").read()
//
If I read it right, this kernel use local install of python3 where there is no hail (that make sense as I am in a container)
from the same notebook, on pyspark kernel
import sys
print(sys.executable)
# /mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/bin/python
print(sys.path)
# ['/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/spark-121ab6c3-bf4f-44b3-bc65-cdff82d02e3f/userFiles-5af21f5a-4a9d-4437-ad70-ef8280b216c4', '', '/opt/hail/python', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6', '/usr/lib/python3.6', '/usr/lib64/python3.6/dist-packages', '/usr/lib/python3.6/dist-packages', '/usr/lib64/python3.6/lib-dynload']
import os
os.popen("python --version").read()
#
os.popen("which python").read()
# /usr/bin/python
os.popen("which pip").read()
# /usr/local/bin/pip
os.popen("python -m pip show hail").read()
#
os.popen("python3 --version").read()
# Python 3.6.8
os.popen("which python3").read()
# /usr/bin/python3
os.popen("which pip3").read()
# /usr/local/bin/pip3
os.popen("python3 -m pip show hail").read()
#
os.popen("which iphyton").read()
#
os.popen("which jupyter").read()
#
os.popen("which pyspark").read()
#
here I am more confuse, but I am able to use spark so my guess is that this kernel use the python3 that is on my cluster through spark (the yarn application…)