hail.utils.java.FatalError: RemoteException while create_last_END_positions

jkgoodrich · September 17, 2020, 2:15am

Hi Hail team,

I have run into another error. The log is too large to attach, but I can slack it to you if that is OK.

I am trying to https://github.com/broadinstitute/gnomad_qc/blob/update_v3_resources/gnomad_qc/v3/load_data/create_last_END_positions.py

     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.57-582b2e31b8bd
LOGGING: writing to /create_last_END_positions.log
[Stage 0:================================================(115376 + -2) / 115376]2020-09-17 02:05:22 Hail: INFO: copying log to 'gs://gnomad-julia/gnomad_v3_1/logs'...
Traceback (most recent call last):
  File "/tmp/441e237cb9f94ac38cbdf9adb8d8f1d8/create_last_END_positions.py", line 40, in <module>
    t.write(last_END_position().path, overwrite=True)
  File "<decorator-gen-1095>", line 2, in write
  File "/opt/conda/default/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/table.py", line 1260, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 297, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 42, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)


Java stack trace:
org.apache.hadoop.ipc.RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1510)
	at org.apache.hadoop.ipc.Client.call(Client.java:1456)
	at org.apache.hadoop.ipc.Client.call(Client.java:1366)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
	at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1845)
	at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1645)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:710)



Hail version: 0.2.57-582b2e31b8bd
Error summary: RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [441e237cb9f94ac38cbdf9adb8d8f1d8] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait '441e237cb9f94ac38cbdf9adb8d8f1d8' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/fb3a788e-d0f0-43af-a18e-8d5a03bb28ff/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/fb3a788e-d0f0-43af-a18e-8d5a03bb28ff/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8/driveroutput
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'PycharmProjects/gnomad_qc/gnomad_qc/v3/load_data/create_last_END_positions.py', '--cluster=jg3', '--files=', '--py-files=/var/folders/sj/6gr3x1553r5f7tkzsjy99mgs64sz0g/T/pyscripts_a51bgm_u.zip', '--properties=', '--', '--overwrite']' returned non-zero exit status 1.

Any suggestions would be greatly appreciated. Thank you!

chrisvittal · September 17, 2020, 3:05am

This error indicates that hdfs is out of space. What is your cluster configuration here?

Densify (and scans in general) lean heavily on hdfs storage for scratch space. The way to increase this is via increased amounts of storage in hdfs, which for dataproc is determined by excess disk space on primary (non-preemptible) workers.

The relevant hailctl option is --worker-boot-disk-size. Across a whole cluster, for densify workloads, I would recommend a few TB of hdfs storage.

jkgoodrich · September 17, 2020, 3:17am

hailctl dataproc --beta start --autoscaling-policy=gnomad_v3_1_autoscale jg3 --master-boot-disk-size 800 --worker-boot-disk-size 100 --master-machine-type n1-highmem-16 --worker-machine-type n1-highmem-8 --init gs://gnomad-public/tools/inits/master-init.sh

chrisvittal · September 17, 2020, 3:27am

That’s not enough worker boot disk, for the relevant autoscaling policy, I would try --worker-boot-disk-size 1000 for this workload.

jkgoodrich · September 17, 2020, 3:27am

Thank you so much Chris! Is the rest of the configuration OK?

chrisvittal · September 17, 2020, 3:28am

Nothing else there causes alarm.

jkgoodrich · September 17, 2020, 3:28am

Thank you, I will give it a try

Topic		Replies	Views
Errors when computing sample qc Hail Query & hailctl	0	230	October 17, 2023
ResultStage has failed the maximum allowable number of times Hail Query & hailctl	26	1642	September 13, 2021
Dataproc error: java.io.IOException: Failed to create local dir Hail Query & hailctl	3	3022	August 29, 2018
Densify running out of memory Hail Query & hailctl	25	1751	June 30, 2020
Container killed on request. Exit code is 137 Hail Query & hailctl	8	610	October 26, 2021

hail.utils.java.FatalError: RemoteException while create_last_END_positions

Related topics