Bokeh not loading on pyspark kernel

Hi,
I use Hail through a jupyter notebook and PySpark kernel.
I am able to load vcfs, apply some filters and such.
I have issue with bohek and the visualization.

if I run python3 kernel, I am able to load bokeh and plot a graph.
but if I run pyspark kernel, bokeh do not load and the plot do not appears…

any suggestion of spark config ?

Are there any messages printed to your Jupyter Notebook? My first guess is a PYTHON-PATH issue. This should be unrelated to Spark configuration because everything is happening in python/Jupyter on the leader node.

Generally, we do not recommend using the pyspark kernel. Do you have a particular reason for using the pyspark kernel?

I am a bit puzzled…
I use Hail on a spark cluster, like that I can distribute my load on multiples CPUs
If i understand well, the pyspark kernel is aware of the cluster and execute the scripts on a distributed manner, where python3 kernel execute the scripts only on the master node…

on a pypark kernel, I am able to init hail

# Import and launch Hail
import hail as hl
hl.init(sc=spark.sparkContext)

but I get an error when importing output_notebook()

from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

/usr/local/lib/python3.6/site-packages/IPython/paths.py:68: UserWarning: IPython parent ‘/home’ is not a writable location, using a temp directory.
" using a temp directory.".format(parent))

on python3 kernel, I am able to load bokeh

from bokeh.io import push_notebook, show, output_notebook
output_notebook()

but hail is not found…

import hail as hl

ModuleNotFoundError: No module named ‘hail’

This is an issue of multiple python installations. Let’s approach this in two pieces:

PySpark vs Python

pyspark is a Python library. It is installed when you install Spark [1]. If you want to know where the pyspark library lives, run this:

python3 -c 'import pyspark; print(pyspark.__file__)'

The pyspark Python library includes an executable also called pyspark. If you want to know where that lives, you can run:

which pyspark

This should be a bash script. You can take a look at it with:

cat $(which pyspark)

This script tries to find a Spark installation and then execute spark-submit. This is one way of interacting with a Spark cluster. You can also interact with a Spark cluster directly from Python. Instead of using spark-submit, Python itself opens a network connection to the Spark leader process and sends jobs over that network connection.

To be very concrete, this:

pyspark

has roughly the same effect as:

# python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> sc = pyspark.SparkContext(master='http://NAME_OF_LEADER_NODE:7077') 

The hail equivalent of the above is:

# python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.init(master='http://NAME_OF_LEADER_NODE:7077')

Hail Installation

I’m confident that you have multiple versions of python that do not know how to find one another’s packages. What is the output of:

which pip
which pip3
which python3
which python
which ipython
which jupyter
pip show hail
pip3 show hail
python3 -m pip show hail
python -m pip show hail
ipython -m pip show hail

Inside a Jupyter Notebook with a pyspark kernel and again inside a Jupyter Notebook with a python3 kernel, please execute:

import sys
print(sys.executable)
print(sys.path)

I’m happy to help you debug further, but I suspect the issue will become clear when you see all these paths. You might also find this GitHub issue helpful.


[1] You can also pip install pyspark, but this version of pyspark is not a full Spark see Python Packaging on the PySpark github

I admit that my setup is a bit messy… but I do not have total control on the stack.

let me give you some explanations.

  • I create a cluster with spark installed on it
  • I compile and install Hail on the cluster
  • I configure spark to work with hail
spark.jars: /opt/hail/hail-all-spark.jar
spark.driver.extraClassPath: /opt/hail/hail-all-spark.jar:....
spark.executor.extraClassPath: /opt/hail/hail-all-spark.jar:...
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator: is.hail.kryo.HailKryoRegistrator
PYSPARK_PYTHON: python3
PYTHONPATH: $PYTHONPATH:/opt/hail/python 

With this config, I am able to run Pyspark and Hail on my cluster.

The notebook is actually running in a different instance that communicate with my cluster. I am not fully sure of the situation there…

from the notebook, on python3 kernel, I got

import sys
print(sys.executable)
// /opt/conda/bin/python
print(sys.path)
// ['/home/notebook/work', '/opt/conda/lib/python37.zip', '/opt/conda/lib/python3.7', '/opt/conda/lib/python3.7/lib-dynload', '', '/opt/conda/lib/python3.7/site-packages', '/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/opt/conda/lib/python3.7/site-packages/IPython/extensions', '/home/notebook/.ipython']
import os
os.popen("python --version").read()
// Python 3.7.3
os.popen("which python").read()
// /opt/conda/bin/python
os.popen("which pip3").read()
// /opt/conda/bin/pip3
os.popen("python -m pip show hail").read()
//
os.popen("which iphyton").read()
//
os.popen("which jupyter").read()
// /opt/conda/bin/jupyter
os.popen("which pyspark").read()
//

If I read it right, this kernel use local install of python3 where there is no hail (that make sense as I am in a container)

from the same notebook, on pyspark kernel

import sys
print(sys.executable)
# /mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/bin/python
print(sys.path)
# ['/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/spark-121ab6c3-bf4f-44b3-bc65-cdff82d02e3f/userFiles-5af21f5a-4a9d-4437-ad70-ef8280b216c4', '', '/opt/hail/python', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6', '/usr/lib/python3.6', '/usr/lib64/python3.6/dist-packages', '/usr/lib/python3.6/dist-packages', '/usr/lib64/python3.6/lib-dynload']
import os
os.popen("python --version").read()
#
os.popen("which python").read()
# /usr/bin/python
os.popen("which pip").read()
# /usr/local/bin/pip
os.popen("python -m pip show hail").read()
#
os.popen("python3 --version").read()
# Python 3.6.8
os.popen("which python3").read()
# /usr/bin/python3
os.popen("which pip3").read()
# /usr/local/bin/pip3
os.popen("python3 -m pip show hail").read()
#
os.popen("which iphyton").read()
#
os.popen("which jupyter").read()
#
os.popen("which pyspark").read()
#

here I am more confuse, but I am able to use spark so my guess is that this kernel use the python3 that is on my cluster through spark (the yarn application…)

The easiest thing will be to get the pyspark kernel working. Do you start Jupyter yourself or is it a service? The problem is user as which Jupyter is executing does not have write permissions to the directory from which Jupyter is executed. You’ll need to change the permissions, change the user, or change the directory. You can determine the user of the Jupyter process by inspecting the output of ps aux.

in my current setup the notebook run as a service. I was in contact with the support and I solved the issue of the permission.

Now, when I load bokeh I do not have error message. But I still cannot see the bokeh logo or the output (plot). Is it a problem that the bokeh code is executed on the cluster node and then the notebook do not receive the response ?

I think this is a javascript issue, which is not super easy to debug. Fortunately I know that the Terra folks have solved a similar problem in the last several months. I’ve reached out to them and will let you know when I have more info!

1 Like