Hi @danking
here some additional infos:
I run HAIL v0.2.32 on an AWS EMR cluster created with Cloudformation. the recipe can be found there
For this union task, I create a cluster of 1 MASTER + 2 CORES using instance type r5.24xlarge, that is 96 CPUs and 768 Gb RAM per node.
I run the code througth Zeppelin notebook as follow
# on AWS EMR cluster - 1 MASTER + 2 CORES - r5.24xlarge (96 CPU - 768Gb RAM) - 500GB Disk space
# Import hail
import hail as hl
hl.init(sc)
# Running on Apache Spark version 2.4.4
# SparkUI available at http://ip-172-31-2-192.ap-southeast-1.compute.internal:4040
# Welcome to
# __ __ <>__
# / /_/ /__ __/ /
# / __ / _ `/ / /
# /_/ /_/\_,_/_/_/ version 0.2.32-a5876a0a2853
# LOGGING: writing to /mnt/var/lib/zeppelin/hail-20200226-0207-0.2.32-a5876a0a2853.log
# Load left mt
mt_left = hl.read_matrix_table('s3://<path>/n7337.vcf.mt')
# 7,337 cols x 157,563,374 rows in 29,504 partitions
# Load right mt
mt_right = hl.read_matrix_table('s3://<path>/n2986.vcf.mt')
# 2986 cols x 94,911,949 rows in 15,502 partitions
# Union
mt_union = mt_left.union_cols(mt_right, 'outer')
# Write Union mt
mt_union.write('s3://<path>/n10323.vcf.mt', overwrite=True)
# Took 3 min 19 sec.
# Fail to execute line 8: mt_union.write("s3://npmchorus-gnomad/hailoutput/SG10K_Health_maxi.n10323.jc.VQSR-pass-only.split-multiallelic.vcf.mt", overwrite=True)
# Traceback (most recent call last):
# File "/tmp/zeppelin_pyspark-7521024825194216637.py", line 380, in <module>
# exec(code, _zcUserQueryNameSpace)
# File "<stdin>", line 8, in <module>
# File "</usr/local/lib/python3.6/site-packages/decorator.py:decorator-gen-1060>", line 2, in write
# File "/opt/hail/python/hail/typecheck/check.py", line 585, in wrapper
# return __original_func(*args_, **kwargs_)
# File "/opt/hail/python/hail/matrixtable.py", line 2522, in write
# Env.backend().execute(MatrixWrite(self._mir, writer))
# File "/opt/hail/python/hail/backend/backend.py", line 109, in execute
# result = json.loads(Env.hc()._jhc.backend().executeJSON(self._to_java_ir(ir)))
# File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
# answer, self.gateway_client, self.target_id, self.name)
# File "/opt/hail/python/hail/utils/java.py", line 225, in deco
# 'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
# hail.utils.java.FatalError: ConnectionPoolTimeoutException: Timeout waiting for connection from pool
here the full hail log:
hail-20200226-0207-0.2.32-a5876a0a2853.log (1.4 MB)
Thanks