hail.utils.java.FatalError: RemoteException while create_last_END_positions

Hi Hail team,

I have run into another error. The log is too large to attach, but I can slack it to you if that is OK.

I am trying to https://github.com/broadinstitute/gnomad_qc/blob/update_v3_resources/gnomad_qc/v3/load_data/create_last_END_positions.py

     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.57-582b2e31b8bd
LOGGING: writing to /create_last_END_positions.log
[Stage 0:================================================(115376 + -2) / 115376]2020-09-17 02:05:22 Hail: INFO: copying log to 'gs://gnomad-julia/gnomad_v3_1/logs'...
Traceback (most recent call last):
  File "/tmp/441e237cb9f94ac38cbdf9adb8d8f1d8/create_last_END_positions.py", line 40, in <module>
    t.write(last_END_position().path, overwrite=True)
  File "<decorator-gen-1095>", line 2, in write
  File "/opt/conda/default/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/table.py", line 1260, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 297, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 42, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)


Java stack trace:
org.apache.hadoop.ipc.RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1510)
	at org.apache.hadoop.ipc.Client.call(Client.java:1456)
	at org.apache.hadoop.ipc.Client.call(Client.java:1366)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
	at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1845)
	at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1645)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:710)



Hail version: 0.2.57-582b2e31b8bd
Error summary: RemoteException: File /tmp/table-map-rows-scan-aggs-part-C6c6GuxitOQsr6MaLlDAAu could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [441e237cb9f94ac38cbdf9adb8d8f1d8] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait '441e237cb9f94ac38cbdf9adb8d8f1d8' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/fb3a788e-d0f0-43af-a18e-8d5a03bb28ff/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/fb3a788e-d0f0-43af-a18e-8d5a03bb28ff/jobs/441e237cb9f94ac38cbdf9adb8d8f1d8/driveroutput
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'PycharmProjects/gnomad_qc/gnomad_qc/v3/load_data/create_last_END_positions.py', '--cluster=jg3', '--files=', '--py-files=/var/folders/sj/6gr3x1553r5f7tkzsjy99mgs64sz0g/T/pyscripts_a51bgm_u.zip', '--properties=', '--', '--overwrite']' returned non-zero exit status 1.

Any suggestions would be greatly appreciated. Thank you!

This error indicates that hdfs is out of space. What is your cluster configuration here?

Densify (and scans in general) lean heavily on hdfs storage for scratch space. The way to increase this is via increased amounts of storage in hdfs, which for dataproc is determined by excess disk space on primary (non-preemptible) workers.

The relevant hailctl option is --worker-boot-disk-size. Across a whole cluster, for densify workloads, I would recommend a few TB of hdfs storage.

hailctl dataproc --beta start --autoscaling-policy=gnomad_v3_1_autoscale jg3 --master-boot-disk-size 800 --worker-boot-disk-size 100 --master-machine-type n1-highmem-16 --worker-machine-type n1-highmem-8 --init gs://gnomad-public/tools/inits/master-init.sh

That’s not enough worker boot disk, for the relevant autoscaling policy, I would try --worker-boot-disk-size 1000 for this workload.

Thank you so much Chris! Is the rest of the configuration OK?

Nothing else there causes alarm.

Thank you, I will give it a try