Fatal Python error: Cannot recover from stack overflow when importing MatrixTable

I built Hail 0.2.70 from source on RHEL 7, then downloaded spark-3.1.2 and set PATH and PYTHONPATH for it. While building hail I needed to substitute -std=c++14 for -std=c++1y in the Makefile for that to run successfully. Then I am launching ipython, importing hail and try importing a MatrixTable which gives me the following error:

In [2]: mt = hl.read_matrix_table('file:///batch109.mt')                                                                                 
Initializing Hail with default parameters...
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-06-24 15:54:13 WARN  Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
  Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://ai-grisnodedev1:4040
Initializing Hail with default parameters...
....
....
Fatal Python error: Cannot recover from stack overflow.

Thread 0x00007f96affff700 (most recent call first):
  File "///.conda/envs/py37/lib/python3.7/socket.py", line 589 in readinto
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 57 in handle
  File "///.conda/envs/py37/lib/python3.7/socketserver.py", line 720 in __init__
  File "///.conda/envs/py37/lib/python3.7/socketserver.py", line 360 in finish_request
  File "///.conda/envs/py37/lib/python3.7/socketserver.py", line 650 in process_request_thread
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 870 in run
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f966f196700 (most recent call first):
  File "///.conda/envs/py37/lib/python3.7/selectors.py", line 415 in select
  File "///.conda/envs/py37/lib/python3.7/socketserver.py", line 232 in serve_forever
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 870 in run
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f96acff9700 (most recent call first):
  File "///.conda/envs/py37/lib/python3.7/selectors.py", line 415 in select
  File "///.conda/envs/py37/lib/python3.7/socketserver.py", line 232 in serve_forever
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 870 in run
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f96b582b700 (most recent call first):
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 296 in wait
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 552 in wait
  File "///.conda/envs/py37/lib/python3.7/site-packages/IPython/core/history.py", line 829 in run
  File "///.conda/envs/py37/lib/python3.7/site-packages/IPython/core/history.py", line 58 in needs_sqlite
  File "<decorator-gen-24>", line 2 in run
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "///.conda/envs/py37/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007f96becef740 (most recent call first):
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 473 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 497 in __init__
  File "///.conda/envs/py37/lib/python3.7/traceback.py", line 104 in print_exception
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 566 in formatException
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 616 in format
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 869 in format
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1025 in emit
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 894 in handle
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1586 in callHandlers
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1524 in handle
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1514 in _log
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1407 in error
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1956 in error
  File "///.conda/envs/py37/lib/python3.7/logging/__init__.py", line 1964 in exception
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1051 in send_command
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1695 in __getattr__
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/pyspark/conf.py", line 120 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 128 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/context.py", line 252 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py", line 577 in wrapper
  File "<decorator-gen-1774>", line 2 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 55 in hc
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 67 in backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 72 in py4j_backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 41 in jutils
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 24 in deco
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 174 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/context.py", line 252 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py", line 577 in wrapper
  File "<decorator-gen-1774>", line 2 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 55 in hc
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 67 in backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 72 in py4j_backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 41 in jutils
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 24 in deco
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 174 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/context.py", line 252 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py", line 577 in wrapper
  File "<decorator-gen-1774>", line 2 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 55 in hc
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 67 in backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 72 in py4j_backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 41 in jutils
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 24 in deco
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 174 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/context.py", line 252 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py", line 577 in wrapper
  File "<decorator-gen-1774>", line 2 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 55 in hc
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 67 in backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 72 in py4j_backend
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/utils/java.py", line 41 in jutils
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 24 in deco
  File "///spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305 in __call__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 174 in __init__
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/context.py", line 252 in init
  File "///.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py", line 577 in wrapper
  File "<decorator-gen-1774>", line 2 in init
  ...
Aborted (core dumped)

I am getting the same error even if I install 0.2.70 by just pip install hail instead of building from source.

What happens when you run this?

python3 -c 'import pyspark; pyspark.SparkContext()'

Just some output, nothing happens:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Hmm.

So, something goes wrong when Hail tries to create a SparkConf. Whatever goes wrong also triggers a stack overflow in Python and we lose the real error message.

Can you start ipython as if you were going to use Hail and then try this:

import pyspark
conf = pyspark.SparkConf()

I expect that to fail. If that fails, then I think there’s a problem with PySpark or Spark. Can you share the output of:

echo $SPARK_HOME
find_spark_home.py
echo $JAVA_HOME
java -version

If the SparkConf line doesn’t fail, can you then try this:

import hail as hl
hl.init()

If that fails, can you copy all the output (you can use gist.github.com if it’s large) and share it with us? Can you also find the hail log file (should be in the current working directory) and upload that somewhere to share with us?

I tried and it does not fail.

echo $SPARK_HOME
/spark/spark-3.1.2-bin-hadoop3.2
echo $JAVA_HOME
/usr/lib/jvm/java
java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)

hl.init() does fail. Here it is:

Hmm.

I do not understand why the stack overflow error occurs. Nor do I understand why there is so many unnecessary stack frames shown.

I copy and pasted the true error below. I’ve not seen this error before, but Google suggests this is due to a misconfigured network. In particular, some node named ai-grisnoddev1 is unable to talk to itself on port 9000. Were there any changes to your firewalls or networking configuration recently?

The Hadoop wiki has some suggestions about debugging this here: ConnectionRefused - HADOOP2 - Apache Software Foundation

Py4JJavaError: An error occurred while calling o31.exists.
: java.net.ConnectException: Call From ai-grisnodedev1/137.187.60.61 to ai-grisnodedev1:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)
	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
	at org.apache.hadoop.ipc.Client.call(Client.java:1457)
	at org.apache.hadoop.ipc.Client.call(Client.java:1367)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:903)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1665)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1582)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1594)
	at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
	at is.hail.io.fs.FS.exists(FS.scala:183)
	at is.hail.io.fs.FS.exists$(FS.scala:181)
	at is.hail.io.fs.HadoopFS.exists(HadoopFS.scala:70)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:690)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:794)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:411)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1572)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	... 35 more


We can run hail 0.2.57 fine, so its certainly related to the way hail itself is implemented and maybe network access that was always disallowed but was not required by previous versions of hail?

This appears to be in the communication layer between Hail’s frontend and backend. That hasn’t changed meaningfully since 0.2.57, with the exception of switching to Spark 3.