Densify on 0.2.49

ch-kr · July 17, 2020, 6:53pm

Hi hail team,

I need to re-run a densify script (300K samples; entries: “GT”, “GQ”, “DP”, “adj”, “END”, “AD”) and ran into this error:

ail.utils.java.FatalError: RemoteException: Cannot create file/tmp/table-map-rows-scan-aggs-part-EvP3J35BxG2Ex3gb6iRitu. Name node is in safe mode.
Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:kc-m
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1413)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1400)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2284)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)


Java stack trace:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/tmp/table-map-rows-scan-aggs-part-EvP3J35BxG2Ex3gb6iRitu. Name node is in safe mode.
Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:kc-m

Any ideas what this means? I’ve switched back to 0.2.40 and restarted the job for now.

I’ll send the script, log, and diagnostic tar file (run after the job failed) via email. Thanks for all the help with my densify issues!

tpoterba · July 17, 2020, 7:00pm

what was your cluster config? Maybe HDFS (which uses disks only on non-preemptible workers) is full.

ch-kr · July 17, 2020, 7:07pm

I used this:

hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8  --init gs://gnomad-public/tools/inits/master-init.sh --master-boot-disk-size 600  --preemptible-worker-boot-disk-size 600 --worker-boot-disk-size 600  --project maclab-ukbb --max-idle=45m --properties=spark:spark.executor-memory=35g,spark:spark.speculation=true,spark:spark.speculation.quantile=0.9,spark:spark.speculation.multiplier=3 --packages gnomad

autoscale_densify scales to up to 1000 preemptibles.

Should I use a bigger cluster? This is the configuration I used back in June on v0.2.40.

tpoterba · July 17, 2020, 7:10pm

I’d just kick up the number of non-preemptibles to 10 or so. If that doesn’t work, we can also increase their disk size. I think what’s going on here is that the densify scan intermediate is stored on HDFS, and that’s getting full and killing the job. Increasing non-preemptibles from 2 to 10 will increase HDFS space by 5x.

ch-kr · July 17, 2020, 7:15pm

thanks! do you think it’s better to restart the job with 10 workers on 0.2.49 or keep the 0.2.40 job running? my 0.2.40 job has only been running for about an hour.

tpoterba · July 17, 2020, 7:17pm

That’s a tough one. I’m not totally sure, you probably have a better sense about the practical operations than I do.

ch-kr · July 17, 2020, 7:19pm

I’ll probably keep the 0.2.40 job running, but I’ll try adding some workers to a 0.2.49 cluster next time, since I should be running another densify relatively soon. Thanks for the super fast responses!

ch-kr · July 21, 2020, 1:19pm

I actually got the same error on both my 0.2.40 job and a new job with 10 workers on 0.2.49. This is the cluster I used on 0.2.49:

hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8  --init gs://gnomad-public/tools/inits/master-init.sh --master-boot-disk-size 600  --preemptible-worker-boot-disk-size 600 --worker-boot-disk-size 600  --project maclab-ukbb --max-idle=45m --properties=spark:spark.executor-memory=35g,spark:spark.speculation=true,spark:spark.speculation.quantile=0.9,spark:spark.speculation.multiplier=3 --packages gnomad --num-workers 10

Should I try adding more workers? Or more disk space?

chrisvittal · July 21, 2020, 3:48pm

Either should be fine. For reference, how big is the matrix table?

Also, could you send me a log file? I’m interested in how large the task results are.

ch-kr · July 21, 2020, 4:06pm

will do! the MT is 7.56TiB

Topic		Replies	Views
Densify running out of memory Hail Query & hailctl	25	1754	June 30, 2020
Container exited with a non-zero exit code 137 Hail Query & hailctl	11	2873	October 6, 2021
hail.utils.java.FatalError: RemoteException while create_last_END_positions Hail Query & hailctl	6	479	September 17, 2020
Final task/partition is hanging Hail Query & hailctl	4	546	September 26, 2019
Resources not allocated Hail Query & hailctl	4	1846	July 20, 2017

Densify on 0.2.49

Related topics