Vds.write error


I got the following error trying to write out the vds. The error summary is at the end. Could you suggest how to fix it? Many thanks!


Errors <<<<<
Hail version: 0.1-38882df
Error summary: RemoteException: File all.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170921152120_0039_m_000087_3/part-00087-ddf85ff9-63af-45f8-8597-52f73dbd7dfc.snappy.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 9 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1622)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3325)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:679)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:214)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:489)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

Looks like you’re writing to an HDFS system that’s full or down. What’s your cluster setup look like?

1 Like

Thanks very much for your quick reply. We are using a Cloudera cluster here. I will pass your opinion to our IT to look into it.

To fix the issue for now, you can write to NFS instead of HDFS. The default file scheme is HDFS, so you’ll need to prefix the path with file:// to write to NFS. This’ll look something like file:///path/to/vds...

I tried that but got IOException of Mkdirs failure. I will ask our IT about it. Thanks very much again!

That’s probably a permission issue or a mistyped path - you should be able to write to NFS. Make sure you have three forward slashes in front of file:

I tried with 2 fwd slashes plus the full path that starts with 1 "/. " I’ll have our IT look into it. Thx!

Got it, best of luck!

Apparently our local NFS on the data nodes are allocated too small. Our HDFS is working. hdfs df also says only 5% of a total of 390TB is used. I had no problem writing vds out with chr20 (18GB vcf.bgz input size) but the current input size is 931GB (~3900 samples WGS). When you talked about HDFS full problem, are you referring to the whole HDFS or the HDFS path I’m trying to write to? Our current set up for Spark cluster is 1 gateway node, 3 management node, 9 data nodes with 32cpu/177GB Mem/44TB hdfs storage for each node. Do you think this is sufficient for the hail task I’m doing?


Just to add that when I started the pyspark, I followed the tutorial for Cloudera cluster:
pyspark2 --jars build/libs/hail-all-spark.jar
–py-files build/distributions/hail-python.zip
–conf spark.sql.files.openCostInBytes=1099511627776
–conf spark.sql.files.maxPartitionBytes=1099511627776
–conf spark.hadoop.parquet.block.size=1099511627776

Do I need to adjust the pqrquet.block.size or just leave it out?

I’m not entirely sure. It’s certainly safe to leave it in. I think Hail will error out at the construction of a HailContext if the Spark Context isn’t properly configured.

If there was HDFS space left, then the “could not be replicated to min number of data nodes” issue could be something else. Hmm…

Just to report back in case it’s useful for others.
I tested leaving the spark.hadoop.parquet.block.size option out while starting Hail and it works. According to our IT the default parquet.block.size is set at 128MB. vds was written successfully and tested to be valid.

A minor correction - Our default parquet.block.size turns out to be 1G (1073741824 precisely) not 128MB. It works fine under the default.

Interesting. In this case I think that the parquet.block.size parameter may be being ignored / overruled. We need to read each parquet file as one Spark partition due to the on-disk ordering system we’ve built, and so use other config options to ensure that Parquet files are never split.

This should get a bit simpler in the next stable version!

That’s what I thought, too. Just out of curiosity, what is the Spark partition size required for Hail, is it 1G (1073741824)? The parquet.block.size parameter on the Hail tutorial is set to be 1TB. I am thinking maybe when I was trying to write vds out somehow hail pre-calculate the needed HDFS storage based on #partitions * block size * replication factor, causing it to give error of not enough space. Could that be the case?