SSLException: connection reset during matrixtable.write

I’ve hit this for the first time: when I write a matrixtable with about 2.74million variants x 11700 samples, I get an exception. I actually see activity the whole time (with four consecutive progress bars, like so:
[Stage 29:==========> (2417 + 256) / 11806]
but at this point I get a stacktrace - see below.

A quick check with a tiny mt (7k variants x 11.8k samples) doesn’t cause the crash.

We’ve been writing similar / larger matrix tables for a bit now, so this is puzzling, any suggestions welcome.


mt_filtered.write(“s3a://intervalwgs-qc/”, overwrite=True)
2019-06-03 20:46:16 Hail: INFO: Coerced sorted dataset
[Stage 29:==========> (2417 + 256) / 11806]Traceback (most recent call last):
File “”, line 1, in
File “</opt/>”, line 2, in write
File “/opt/”, line 561, in wrapper
return original_func(*args, **kwargs)
File “/opt/”, line 2494, in write
Env.backend().execute(MatrixWrite(self._mir, writer))
File “/opt/”, line 106, in execute
result = json.loads(Env.hail().backend.spark.SparkBackend.executeJSON(self._to_java_ir(ir)))
File “/opt/”, line 1257, in call
File “/opt/”, line 240, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None SocketException: Connection reset

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1637 in stage 29.0 failed 4 times, most recent failure: Lost task 1637.3 in stage 29.0 (TID 121871,, executor 29): Connection reset

We have seen this issue before related to using S3. I vastly prefer using the forum for support (thank you!) since it’s more searchable, but there’s a Zulip conversation about this.

It seems to be related to holding open file streams for too long – and can maybe be solved by rewriting the pipeline to be more efficient (or insert intermediate write/reads steps), or by decreasing partition size (though it looks like you have small partitions already).

You might also try using stage_locally=True in the write. That causes Hail to write the output to a node local file (e.g. in /tmp) and then copy the fine to the final destination in one shot after the write is complete.