Bokeh not loading on pyspark kernel

Hi,
I use Hail through a jupyter notebook and PySpark kernel.
I am able to load vcfs, apply some filters and such.
I have issue with bohek and the visualization.

if I run python3 kernel, I am able to load bokeh and plot a graph.
but if I run pyspark kernel, bokeh do not load and the plot do not appears…

any suggestion of spark config ?

Are there any messages printed to your Jupyter Notebook? My first guess is a PYTHON-PATH issue. This should be unrelated to Spark configuration because everything is happening in python/Jupyter on the leader node.

Generally, we do not recommend using the pyspark kernel. Do you have a particular reason for using the pyspark kernel?

I am a bit puzzled…
I use Hail on a spark cluster, like that I can distribute my load on multiples CPUs
If i understand well, the pyspark kernel is aware of the cluster and execute the scripts on a distributed manner, where python3 kernel execute the scripts only on the master node…

on a pypark kernel, I am able to init hail

# Import and launch Hail
import hail as hl
hl.init(sc=spark.sparkContext)

but I get an error when importing output_notebook()

from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

/usr/local/lib/python3.6/site-packages/IPython/paths.py:68: UserWarning: IPython parent ‘/home’ is not a writable location, using a temp directory.
" using a temp directory.".format(parent))

on python3 kernel, I am able to load bokeh

from bokeh.io import push_notebook, show, output_notebook
output_notebook()

but hail is not found…

import hail as hl

ModuleNotFoundError: No module named ‘hail’

This is an issue of multiple python installations. Let’s approach this in two pieces:

PySpark vs Python

pyspark is a Python library. It is installed when you install Spark [1]. If you want to know where the pyspark library lives, run this:

python3 -c 'import pyspark; print(pyspark.__file__)'

The pyspark Python library includes an executable also called pyspark. If you want to know where that lives, you can run:

which pyspark

This should be a bash script. You can take a look at it with:

cat $(which pyspark)

This script tries to find a Spark installation and then execute spark-submit. This is one way of interacting with a Spark cluster. You can also interact with a Spark cluster directly from Python. Instead of using spark-submit, Python itself opens a network connection to the Spark leader process and sends jobs over that network connection.

To be very concrete, this:

pyspark

has roughly the same effect as:

# python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> sc = pyspark.SparkContext(master='http://NAME_OF_LEADER_NODE:7077') 

The hail equivalent of the above is:

# python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.init(master='http://NAME_OF_LEADER_NODE:7077')

Hail Installation

I’m confident that you have multiple versions of python that do not know how to find one another’s packages. What is the output of:

which pip
which pip3
which python3
which python
which ipython
which jupyter
pip show hail
pip3 show hail
python3 -m pip show hail
python -m pip show hail
ipython -m pip show hail

Inside a Jupyter Notebook with a pyspark kernel and again inside a Jupyter Notebook with a python3 kernel, please execute:

import sys
print(sys.executable)
print(sys.path)

I’m happy to help you debug further, but I suspect the issue will become clear when you see all these paths. You might also find this GitHub issue helpful.


[1] You can also pip install pyspark, but this version of pyspark is not a full Spark see Python Packaging on the PySpark github

I admit that my setup is a bit messy… but I do not have total control on the stack.

let me give you some explanations.

  • I create a cluster with spark installed on it
  • I compile and install Hail on the cluster
  • I configure spark to work with hail
spark.jars: /opt/hail/hail-all-spark.jar
spark.driver.extraClassPath: /opt/hail/hail-all-spark.jar:....
spark.executor.extraClassPath: /opt/hail/hail-all-spark.jar:...
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator: is.hail.kryo.HailKryoRegistrator
PYSPARK_PYTHON: python3
PYTHONPATH: $PYTHONPATH:/opt/hail/python 

With this config, I am able to run Pyspark and Hail on my cluster.

The notebook is actually running in a different instance that communicate with my cluster. I am not fully sure of the situation there…

from the notebook, on python3 kernel, I got

import sys
print(sys.executable)
// /opt/conda/bin/python
print(sys.path)
// ['/home/notebook/work', '/opt/conda/lib/python37.zip', '/opt/conda/lib/python3.7', '/opt/conda/lib/python3.7/lib-dynload', '', '/opt/conda/lib/python3.7/site-packages', '/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/opt/conda/lib/python3.7/site-packages/IPython/extensions', '/home/notebook/.ipython']
import os
os.popen("python --version").read()
// Python 3.7.3
os.popen("which python").read()
// /opt/conda/bin/python
os.popen("which pip3").read()
// /opt/conda/bin/pip3
os.popen("python -m pip show hail").read()
//
os.popen("which iphyton").read()
//
os.popen("which jupyter").read()
// /opt/conda/bin/jupyter
os.popen("which pyspark").read()
//

If I read it right, this kernel use local install of python3 where there is no hail (that make sense as I am in a container)

from the same notebook, on pyspark kernel

import sys
print(sys.executable)
# /mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/bin/python
print(sys.path)
# ['/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/spark-121ab6c3-bf4f-44b3-bc65-cdff82d02e3f/userFiles-5af21f5a-4a9d-4437-ad70-ef8280b216c4', '', '/opt/hail/python', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/pyspark.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/py4j-0.10.7-src.zip', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/local/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib/python3.6/site-packages', '/mnt/yarn/usercache/user_hebrardms/appcache/application_1574904514169_0001/container_1574904514169_0001_01_000001/tmp/1574906861363-0/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6', '/usr/lib/python3.6', '/usr/lib64/python3.6/dist-packages', '/usr/lib/python3.6/dist-packages', '/usr/lib64/python3.6/lib-dynload']
import os
os.popen("python --version").read()
#
os.popen("which python").read()
# /usr/bin/python
os.popen("which pip").read()
# /usr/local/bin/pip
os.popen("python -m pip show hail").read()
#
os.popen("python3 --version").read()
# Python 3.6.8
os.popen("which python3").read()
# /usr/bin/python3
os.popen("which pip3").read()
# /usr/local/bin/pip3
os.popen("python3 -m pip show hail").read()
#
os.popen("which iphyton").read()
#
os.popen("which jupyter").read()
#
os.popen("which pyspark").read()
#

here I am more confuse, but I am able to use spark so my guess is that this kernel use the python3 that is on my cluster through spark (the yarn application…)

The easiest thing will be to get the pyspark kernel working. Do you start Jupyter yourself or is it a service? The problem is user as which Jupyter is executing does not have write permissions to the directory from which Jupyter is executed. You’ll need to change the permissions, change the user, or change the directory. You can determine the user of the Jupyter process by inspecting the output of ps aux.

in my current setup the notebook run as a service. I was in contact with the support and I solved the issue of the permission.

Now, when I load bokeh I do not have error message. But I still cannot see the bokeh logo or the output (plot). Is it a problem that the bokeh code is executed on the cluster node and then the notebook do not receive the response ?

I think this is a javascript issue, which is not super easy to debug. Fortunately I know that the Terra folks have solved a similar problem in the last several months. I’ve reached out to them and will let you know when I have more info!

1 Like

@mhebrard if you open the “Developer Console” in Safari / Chrome / whatever-browser-you’re-using, do you see any error messages?

@danking if I create a new notebook and load bokeh or hl.plot.output_notebook() I do not see any error. (the bokeh logo do not appears). If I open a notebook with code already inside i get the error “Bokeh: ERROR: autoload.js configured with elementid ‘1001’ but no matching script tag was found.”

Which web browser are you using and what version is it? And what OS do you use?

I am using Chrome Version 79.0.3945.79 on Windows 10 to navigate jupyterLab through AWS web console. The notebook itself run under rhel 6 and communicate with the Spark cluster.

Hi here.
Some evolution on this topic. I was a bit fedup with my situation of jupyter notebook and Hail runing in multiple layers of containers on which I do not have full configuration access…

Then I tryed Zeppelin instead of Jupyter…

  • I succeed to have zeppelin running pyspark with python3
%pyspark
import sys
print(sys.executable)
# /usr/bin/python3
print(sys.version)
# 3.6.8 
  • I succeed to have bokehjs plot running on pyspark and displayed within the notebook
%pyspark
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
import bkzep
output_notebook(notebook_type='zeppelin')

f=figure()
f.line(x=[1,2],y=[3,4])
show(f)
# Here I see the Loading BokehJS... Then the plot
  • I succeed to have hail running on pyspark
%pyspark
import hail as hl
hl.init(sc)
# Here I see Hail logo (v0.2.25) 

The problem is that in order to see bokeh plot in the notebook I need bkzep library that (if I understand well) extend output_notebook() and make the magic for zeppelin. But hail use another output_notebook() that cannot be extended by bkzep

%pyspark
from hail.plot import show, output_notebook
from pprint import pprint
import bkzep
output_notebook(notebook_type='zeppelin')
Fail to execute line 4: output_notebook(notebook_type='zeppelin')
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-7073256350841673165.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 4, in <module>
TypeError: output_notebook() got an unexpected keyword argument 'notebook_type'

Do you think there is a better chance using this config to be able to laverage the hail functions to plot with bokeh in a full pyspark notebook ?

Hail’s output_notebook is just an alias for bokeh.io.output_notebook. Just replace that call with your bkzep one that works!

I succeed to plot with bokeh and hail in a zeppelin notebook !!

%pyspark
# Import bokeh
from bokeh.io import show, output_notebook
import bkzep
output_notebook(notebook_type='zeppelin')
# Import hail
import hail as hl
hl.init(sc)
# Load a mt
mt = hl.read_matrix_table("s3://npmchorus-gnomad/hailoutput/SG10K_Health_test.n96.jc.VQSR-pass-only.split-multiallelic.vep95.vcf.mt")
# Plot DP
p = hl.plot.histogram(mt.DP, range=(0,100), bins=50, title='DP Histogram', legend='DP')
show(p)
# Draw a nice interactive histogram

I consider my problem of plotting solved for now.
Thanks for your help !!

Hi mhebrard,

I encountered the same scenario of not able to plot with pyspark and just followed your comment of importing bkzep and notebook_type=‘zeppelin’, but I am getting an error ‘jinja2.exceptions.UndefinedError: ‘bundle’ is undefined’ and I also searched your other link where you had raised this issue “https://github.com/bokeh/bokeh/issues/9854”. I did not find any answers for that, could you please tell me how you solved this issue?

Hi,

In my installation of hail dependencies, I force the version of bokeh to be 2.0.0

echo '# Install hail dependencies #'
WHEELS="bkzep
bokeh==2.0.0
boto3
decorator
ipython
paramiko
parsimonious
pandas==0.25
pyyaml
requests
scipy"
for WHEEL_NAME in $WHEELS
do
  sudo python -m pip install $WHEEL_NAME
done

But the gihub mention that the issue is solved for the 2.0.1 … I haven’t test since ( I am still using 2.0.0 currently

Hi guys,

I am taking this topic out the catacombs to ask for help related to the same issue pointed by @mhebrard To get around with this problem I am applying this sorcery bellow to directly send the plot to the S3. Bokeh/hail is not working yet with plots.

html = file_html(p, CDN, "Chart")
with hl.utils.hadoop_open("s3://PATH.html",
                          "w") as f:
    f.write(html)

Any advance in this issue?

1 Like

From what I can tell, It seams that there is some missing libraries in jupyter notebook, pyspark kernel. We do not get the nice HTML display that python3 kernel got.
i.e. no bokeh plot and no prettify tables.
I did not find a magic solution to solve that issue yet. Hope someone can find a workaround…
One lead might be Databrics internal function that seams to succesfully display HTML in pyspark kernel… But I was not able to replicate that