SSLException: connection reset during matrixtable.write

I’ve hit this for the first time: when I write a matrixtable with about 2.74million variants x 11700 samples, I get an exception. I actually see activity the whole time (with four consecutive progress bars, like so:
[Stage 29:==========> (2417 + 256) / 11806]
but at this point I get a stacktrace - see below.

A quick check with a tiny mt (7k variants x 11.8k samples) doesn’t cause the crash.

We’ve been writing similar / larger matrix tables for a bit now, so this is puzzling, any suggestions welcome.

Vivek

mt_filtered.write(“s3a://intervalwgs-qc/chr20_filtered_vvi.mt”, overwrite=True)
2019-06-03 20:46:16 Hail: INFO: Coerced sorted dataset
[Stage 29:==========> (2417 + 256) / 11806]Traceback (most recent call last):
File “”, line 1, in
File “</opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-946>”, line 2, in write
File “/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/typecheck/check.py”, line 561, in wrapper
return original_func(*args, **kwargs)
File “/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/matrixtable.py”, line 2494, in write
Env.backend().execute(MatrixWrite(self._mir, writer))
File “/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py”, line 106, in execute
result = json.loads(Env.hail().backend.spark.SparkBackend.executeJSON(self._to_java_ir(ir)))
File “/opt/sanger.ac.uk/hgi/spark-2.4.3-bin-hgi-hadoop2.7.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in call
File “/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/utils/java.py”, line 240, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: SocketException: Connection reset

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1637 in stage 29.0 failed 4 times, most recent failure: Lost task 1637.3 in stage 29.0 (TID 121871, 192.168.226.78, executor 29): javax.net.ssl.SSLException: Connection reset

We have seen this issue before related to using S3. I vastly prefer using the forum for support (thank you!) since it’s more searchable, but there’s a Zulip conversation about this.

It seems to be related to holding open file streams for too long – and can maybe be solved by rewriting the pipeline to be more efficient (or insert intermediate write/reads steps), or by decreasing partition size (though it looks like you have small partitions already).

You might also try using stage_locally=True in the write. That causes Hail to write the output to a node local file (e.g. in /tmp) and then copy the fine to the final destination in one shot after the write is complete.