Densify on 0.2.49

Hi hail team,

I need to re-run a densify script (300K samples; entries: “GT”, “GQ”, “DP”, “adj”, “END”, “AD”) and ran into this error:

ail.utils.java.FatalError: RemoteException: Cannot create file/tmp/table-map-rows-scan-aggs-part-EvP3J35BxG2Ex3gb6iRitu. Name node is in safe mode.
Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:kc-m
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1413)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1400)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2284)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)


Java stack trace:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/tmp/table-map-rows-scan-aggs-part-EvP3J35BxG2Ex3gb6iRitu. Name node is in safe mode.
Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:kc-m

Any ideas what this means? I’ve switched back to 0.2.40 and restarted the job for now.

I’ll send the script, log, and diagnostic tar file (run after the job failed) via email. Thanks for all the help with my densify issues!

what was your cluster config? Maybe HDFS (which uses disks only on non-preemptible workers) is full.

I used this:

hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8  --init gs://gnomad-public/tools/inits/master-init.sh --master-boot-disk-size 600  --preemptible-worker-boot-disk-size 600 --worker-boot-disk-size 600  --project maclab-ukbb --max-idle=45m --properties=spark:spark.executor-memory=35g,spark:spark.speculation=true,spark:spark.speculation.quantile=0.9,spark:spark.speculation.multiplier=3 --packages gnomad

autoscale_densify scales to up to 1000 preemptibles.

Should I use a bigger cluster? This is the configuration I used back in June on v0.2.40.

I’d just kick up the number of non-preemptibles to 10 or so. If that doesn’t work, we can also increase their disk size. I think what’s going on here is that the densify scan intermediate is stored on HDFS, and that’s getting full and killing the job. Increasing non-preemptibles from 2 to 10 will increase HDFS space by 5x.

thanks! do you think it’s better to restart the job with 10 workers on 0.2.49 or keep the 0.2.40 job running? my 0.2.40 job has only been running for about an hour.

That’s a tough one. I’m not totally sure, you probably have a better sense about the practical operations than I do.

I’ll probably keep the 0.2.40 job running, but I’ll try adding some workers to a 0.2.49 cluster next time, since I should be running another densify relatively soon. Thanks for the super fast responses!

I actually got the same error on both my 0.2.40 job and a new job with 10 workers on 0.2.49. This is the cluster I used on 0.2.49:

hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8  --init gs://gnomad-public/tools/inits/master-init.sh --master-boot-disk-size 600  --preemptible-worker-boot-disk-size 600 --worker-boot-disk-size 600  --project maclab-ukbb --max-idle=45m --properties=spark:spark.executor-memory=35g,spark:spark.speculation=true,spark:spark.speculation.quantile=0.9,spark:spark.speculation.multiplier=3 --packages gnomad --num-workers 10

Should I try adding more workers? Or more disk space?

Either should be fine. For reference, how big is the matrix table?

Also, could you send me a log file? I’m interested in how large the task results are.

will do! the MT is 7.56TiB