Task failed while writing rows

Hi, I am exporting MT to vcf.bgz with GCP Dataproc (2 normal + 15 secondary nodes). This code worked well at several months ago but now it give me error. (same code, only changed input).
code I used:

import hail as hl
hl.init(default_reference='GRCh38')
mt = hl.read_matrix_table(mt_path)
hl.summarize_variants(mt)
hl.export_vcf(mt, 'gs://path/out.vcf.bgz')

Hail and Spark version I used and error I got:

Running on Apache Spark version 3.1.2
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.89-38264124ad91
Hail: WARN: export_vcf: ignored the following fields:
    'variant_qc' (row)

Traceback (most recent call last):                        (9513 + 1089) / 28690]
  File "/tmp/23d31cfb27854652b0c9c60754140170/step3.1.4_14062022.py", line 8, in <module>
    hl.export_vcf(mt, 'gs://path/out.vcf.bgz')
  File "<decorator-gen-1330>", line 2, in export_vcf
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/methods/impex.py", line 551, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 110, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 86, in execute
    result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: RemoteException: File /tmp/write-table-concatenated-UiZi2iY4xWo4Q2mFeDyxeJ/_temporary/0/_temporary/attempt_20220614081406830712610524875364_0011_m_010536_19/part-10536.bgz could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2278)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2808)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:905)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:577)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
...

There are more error message, basically they repeat this kind of error:

org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:162)
        at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

org.apache.hadoop.ipc.RemoteException: File /tmp/write-table-concatenated-UiZi2iY4xWo4Q2mFeDyxeJ/_temporary/0/_temporary/attempt_20220614081406830712610524875364_0011_m_010536_19/part-10536.bgz could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

Any idea about what caused this and how I can solve it? Do I need to upgrade to a newer version of Hail to solve this? Thanks a lot for you time and any help are welcome.

There’s not enough temporary space on HDFS to write this VCF. The disks making up HDFS are provided by the normal worker nodes (not secondary). You can add more disk by adding more normal nodes, or adding SSDs to them with --num-worker-local-ssds=1. The best strategy is probably to do both – use ~5 normal nodes and add 1 ssd to each.

Hi @tpoterba thanks a lot. It works!