Resources not allocated

Have you seen an error “Container exited with a non-zero exit code 127” ?

Pyspark2 has no issue when not running hail but can’t schedule resources when hail for the hc container.

Hail doesn’t do anything special with the SparkContext that is passed into it. I think something must be wrong with the Spark configuration upstream.

I would agree with you if Spark was not running fine by itself. The cluster is a Cloudera CDH 5.12. I have Yarn managing Spark and the cluster is running on Azure. If any of that tells you anything.

can you give us more information, like a full stack trace?

Input:
pyspark2 --jars $HAIL_HOME/build/libs/hail-all-spark.jar
–py-files $HAIL_HOME/python/hail-python.zip
–conf spark.executorEnv.JAVA_HOME=/opt/jdk1.8.0.131
–conf spark.executor.extraClassPath=/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2
–conf spark.driver.extraClassPath=/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2
–conf spark.hadoop.io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec
–conf spark.sql.files.openCostInBytes=1099511627776
–conf spark.sql.files.maxPartitionBytes=1099511627776
–conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=1099511627776
–conf spark.hadoop.parquet.block.size=109951162777

from hail import *
hc = HailContext(sc)
hc.import_vcf(’/user/kmlong/vcf/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf’).write(’/user/kmlong/vds/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vds’)

hail.log Output:

[kmlong@lsa12-dn0 ~]$ cat hail.log
2017-07-20 08:44:59 INFO Hail:15 - SparkUI: http://10.0.0.9:4040
2017-07-20 08:44:59 INFO Hail:180 - Spark properties: spark.executorEnv.JAVA_HOME=/opt/jdk1.8.0.131, spark.app.id=application_1500479444467_0007, spark.eventLog.enabled=true, spark.driver.port=55100, spark.yarn.secondary.jars=hail-all-spark.jar, spark.executorEnv.PYTHONPATH=/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip:/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/:/usr/local/bin/python2.7:/opt/hail/python/hail-python.zip/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip{{PWD}}/hail-python.zip, spark.yarn.dist.jars=file:/opt/hail/build/libs/hail-all-spark.jar, spark.ui.killEnabled=true, spark.dynamicAllocation.executorIdleTimeout=60, spark.sql.files.openCostInBytes=1099511627776, spark.serializer=org.apache.spark.serializer.KryoSerializer, spark.authenticate=false, spark.hadoop.mapreduce.input.fileinputformat.split.minsize=1099511627776, spark.driver.appUIAddress=http://10.0.0.9:4040, spark.yarn.am.extraLibraryPath=/log/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/lib/native, spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/…/hive/lib/:${env:HADOOP_COMMON_HOME}/client/, spark.serializer.objectStreamReset=100, spark.submit.deployMode=client, spark.ui.filters=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://lsa12-mn0.eastus.cloudapp.azure.com:8088/proxy/application_1500479444467_0007, spark.driver.host=10.0.0.9, spark.sql.files.maxPartitionBytes=1099511627776, spark.yarn.historyServer.address=http://lsa12-mn0.eastus.cloudapp.azure.com:18089, spark.shuffle.service.enabled=true, spark.executor.extraClassPath=/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2, spark.driver.extraLibraryPath=/log/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/lib/native, spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=lsa12-mn0.eastus.cloudapp.azure.com, spark.executor.id=driver, spark.dynamicAllocation.schedulerBacklogTimeout=1, spark.eventLog.dir=hdfs://lsa12-mn0.eastus.cloudapp.azure.com:8020/user/spark/spark2ApplicationHistory, spark.shuffle.service.port=7337, spark.sql.hive.metastore.version=1.1.0, spark.app.name=PySparkShell, spark.hadoop.yarn.application.classpath=, spark.driver.extraClassPath=/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2, spark.master=yarn, spark.executor.extraLibraryPath=/log/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/lib/native, spark.hadoop.parquet.block.size=1099511627776, spark.sql.warehouse.dir=/user/hive/warehouse, spark.sql.catalogImplementation=hive, spark.hadoop.io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec, spark.yarn.config.gatewayPath=/log/cloudera/parcels, spark.rdd.compress=True, spark.hadoop.mapreduce.application.classpath=, spark.submit.pyFiles=/opt/hail/python/hail-python.zip, spark.dynamicAllocation.minExecutors=0, spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/…/…/…, spark.yarn.jars=local:/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/jars/*, spark.dynamicAllocation.enabled=true, spark.yarn.isPython=true
2017-07-20 08:44:59 INFO Hail:195 - Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.1-5661c35
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1176.0 B, free 366.3 MB)
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 386.0 B, free 366.3 MB)
2017-07-20 08:45:10 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.0.0.9:42084 (size: 386.0 B, free: 366.3 MB)
2017-07-20 08:45:10 INFO SparkContext:54 - Created broadcast 0 from broadcast at LoadVCF.scala:263
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 366.3 MB)
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 68.0 B, free 366.3 MB)
2017-07-20 08:45:10 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.0.0.9:42084 (size: 68.0 B, free: 366.3 MB)
2017-07-20 08:45:10 INFO SparkContext:54 - Created broadcast 1 from broadcast at LoadVCF.scala:264
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 41.2 KB, free 366.3 MB)
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 11.2 KB, free 366.2 MB)
2017-07-20 08:45:10 INFO BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on 10.0.0.9:42084 (size: 11.2 KB, free: 366.3 MB)
2017-07-20 08:45:10 INFO SparkContext:54 - Created broadcast 2 from broadcast at LoadVCF.scala:266
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_3 stored as values in memory (estimated size 303.7 KB, free 366.0 MB)
2017-07-20 08:45:10 INFO MemoryStore:54 - Block broadcast_3_piece0 stored as bytes in memory (estimated size 28.2 KB, free 365.9 MB)
2017-07-20 08:45:10 INFO BlockManagerInfo:54 - Added broadcast_3_piece0 in memory on 10.0.0.9:42084 (size: 28.2 KB, free: 366.3 MB)
2017-07-20 08:45:10 INFO SparkContext:54 - Created broadcast 3 from textFile at RichSparkContext.scala:16
2017-07-20 08:45:10 INFO FileInputFormat:249 - Total input paths to process : 1
2017-07-20 08:45:10 INFO SparkContext:54 - Starting job: fold at RichRDD.scala:26
2017-07-20 08:45:10 INFO DAGScheduler:54 - Got job 0 (fold at RichRDD.scala:26) with 1 output partitions
2017-07-20 08:45:10 INFO DAGScheduler:54 - Final stage: ResultStage 0 (fold at RichRDD.scala:26)
2017-07-20 08:45:10 INFO DAGScheduler:54 - Parents of final stage: List()
2017-07-20 08:45:10 INFO DAGScheduler:54 - Missing parents: List()
2017-07-20 08:45:10 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[5] at mapPartitions at RichRDD.scala:24), which has no missing parents
2017-07-20 08:45:11 INFO MemoryStore:54 - Block broadcast_4 stored as values in memory (estimated size 4.9 KB, free 365.9 MB)
2017-07-20 08:45:11 INFO MemoryStore:54 - Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.7 KB, free 365.9 MB)
2017-07-20 08:45:11 INFO BlockManagerInfo:54 - Added broadcast_4_piece0 in memory on 10.0.0.9:42084 (size: 2.7 KB, free: 366.3 MB)
2017-07-20 08:45:11 INFO SparkContext:54 - Created broadcast 4 from broadcast at DAGScheduler.scala:996
2017-07-20 08:45:11 INFO DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at mapPartitions at RichRDD.scala:24)
2017-07-20 08:45:11 INFO YarnScheduler:54 - Adding task set 0.0 with 1 tasks
2017-07-20 08:45:12 INFO ExecutorAllocationManager:54 - Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
2017-07-20 08:45:16 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Container marked as failed: container_1500479444467_0007_01_000002 on host: lsa12-dn2.eastus.cloudapp.azure.com. Exit status: 127. Diagnostics: Exception from container-launch.
Container id: container_1500479444467_0007_01_000002
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 127

2017-07-20 08:45:16 INFO BlockManagerMaster:54 - Removal of executor 1 requested
2017-07-20 08:45:16 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Asked to remove non-existent executor 1
2017-07-20 08:45:16 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2017-07-20 08:45:22 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Container marked as failed: container_1500479444467_0007_01_000003 on host: lsa12-dn1.eastus.cloudapp.azure.com. Exit status: 127. Diagnostics: Exception from container-launch.
Container id: container_1500479444467_0007_01_000003
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 127

2017-07-20 08:45:22 INFO BlockManagerMaster:54 - Removal of executor 2 requested
2017-07-20 08:45:22 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 2 from BlockManagerMaster.
2017-07-20 08:45:22 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Asked to remove non-existent executor 2
2017-07-20 08:45:26 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2017-07-20 08:45:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Container marked as failed: container_1500479444467_0007_01_000004 on host: lsa12-dn1.eastus.cloudapp.azure.com. Exit status: 127. Diagnostics: Exception from container-launch.
Container id: container_1500479444467_0007_01_000004
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 127

2017-07-20 08:45:29 INFO BlockManagerMaster:54 - Removal of executor 3 requested
2017-07-20 08:45:29 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 3 from BlockManagerMaster.
2017-07-20 08:45:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Asked to remove non-existent executor 3