Task failed while writing rows

shuang · June 14, 2022, 8:49am

Hi, I am exporting MT to vcf.bgz with GCP Dataproc (2 normal + 15 secondary nodes). This code worked well at several months ago but now it give me error. (same code, only changed input).
code I used:

import hail as hl
hl.init(default_reference='GRCh38')
mt = hl.read_matrix_table(mt_path)
hl.summarize_variants(mt)
hl.export_vcf(mt, 'gs://path/out.vcf.bgz')

Hail and Spark version I used and error I got:

Running on Apache Spark version 3.1.2
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.89-38264124ad91
Hail: WARN: export_vcf: ignored the following fields:
    'variant_qc' (row)

Traceback (most recent call last):                        (9513 + 1089) / 28690]
  File "/tmp/23d31cfb27854652b0c9c60754140170/step3.1.4_14062022.py", line 8, in <module>
    hl.export_vcf(mt, 'gs://path/out.vcf.bgz')
  File "<decorator-gen-1330>", line 2, in export_vcf
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/methods/impex.py", line 551, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 110, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 86, in execute
    result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: RemoteException: File /tmp/write-table-concatenated-UiZi2iY4xWo4Q2mFeDyxeJ/_temporary/0/_temporary/attempt_20220614081406830712610524875364_0011_m_010536_19/part-10536.bgz could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2278)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2808)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:905)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:577)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
...

There are more error message, basically they repeat this kind of error:

org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:162)
        at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

org.apache.hadoop.ipc.RemoteException: File /tmp/write-table-concatenated-UiZi2iY4xWo4Q2mFeDyxeJ/_temporary/0/_temporary/attempt_20220614081406830712610524875364_0011_m_010536_19/part-10536.bgz could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

Any idea about what caused this and how I can solve it? Do I need to upgrade to a newer version of Hail to solve this? Thanks a lot for you time and any help are welcome.

tpoterba · June 14, 2022, 11:45am

There’s not enough temporary space on HDFS to write this VCF. The disks making up HDFS are provided by the normal worker nodes (not secondary). You can add more disk by adding more normal nodes, or adding SSDs to them with --num-worker-local-ssds=1. The best strategy is probably to do both – use ~5 normal nodes and add 1 ssd to each.

shuang · June 15, 2022, 11:56am

Hi @tpoterba thanks a lot. It works!

Topic		Replies	Views
Hail Exception crash during export step - how to diagnose Hail Query & hailctl	4	1009	June 3, 2019
VCFParseError on write MatrixTable Hail Query & hailctl	3	996	May 9, 2020
Container killed on request. Exit code is 137 Hail Query & hailctl	8	602	October 26, 2021
UKBiobank Research Analysis Platform (RAP) MatrixTable Write Issues Hail Query & hailctl	21	2530	February 22, 2023
Issues writing matrix table from filtered pVCF (UK Biobank data) Hail Query & hailctl	1	456	September 22, 2022

Task failed while writing rows

Related topics