Run-time error when using spark-submit

whu_wade · November 13, 2018, 1:34am

My submit command like below, and get some error. I have test some hail function, import_vcf() is ok, but output errors when I use count(),show() ,take() or export_vcf() such action functions.

spark2-submit --master yarn --jars “$HAIL_HOME/build/libs/hail-all-spark.jar”
–py-files “$HAIL_HOME/build/distributions/hail-python.zip”
–conf spark.driver.extraClassPath="$HAIL_HOME/build/libs/hail-all-spark.jar"
–conf spark.executor.extraClassPath=./hail-all-spark.jar
–conf spark.serializer=org.apache.spark.serializer.KryoSerializer
–conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
–num-executors 10 --executor-memory 20g --executor-cores 4 stat_call_rate.py

#stat_call_rate.py
import hail as hl
import hail.expr.aggregators as agg
hl.init(app_name=‘hail_statCallRate’,log=‘cr.log’,tmp_dir=‘file:///path_prefix/population’,default_reference=‘GRCh38’)
vcf=hl.import_vcf(“file:///path_prefix/test.vcf”)
vcf=hl.sample_qc(vcf)
hl.export_vcf(vcf,“file:///path_prefix/test.vcf.bgz”)

ERROR:
[Stage 1:> (0 + 7) / 7]Traceback (most recent call last):
File “stat_call_rate.py”, line 8, in
hl.export_vcf(vcf,“file:///test_path/test.vcf.bgz”)
File “”, line 2, in export_vcf
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 560, in wrapper
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/methods/impex.py”, line 422, in export_vcf
File “/path_prefix/SPARK2/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/utils/java.py”, line 210, in deco
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 20, sz-hadoop-55-60.local, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1537175360952_0263_01_000012 on host: sz-hadoop-55-60.local. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1537175360952_0263_01_000012
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

tpoterba · November 13, 2018, 11:00am

Are you able to run pyspark on your cluster?

whu_wade · November 14, 2018, 1:26am

I tried just now:
pyspark2 --jars $HAIL_HOME/build/libs/hail-all-spark.jar --conf spark.driver.extraClassPath=$HAIL_HOME/build/libs/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator

#interactive command
import hail as hl
hl.init(sc)
vcf=hl.import_vcf(“file:///prefix_path/test.vcf”,reference_genome=‘GRCh38’) #that’s all OK until here
vcf.count() # error as below

[Stage 1:> (0 + 2) / 7]Traceback (most recent call last):
File “”, line 1, in
File “/prefix_path/hail/build/distributions/hail-python.zip/hail/matrixtable.py”, line 2117, in count
File “/prefix_path/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/prefix_path/hail/build/distributions/hail-python.zip/hail/utils/java.py”, line 210, in deco
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 15, seps-hadoop-a02-4.seps.sz.hpc, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container marked as failed: container_e26_1537320290255_262214_01_000024 on host: seps-hadoop-a02-4.seps.sz.hpc. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e26_1537320290255_262214_01_000024
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

Shell output: main : command provided 1
main : run as user is wade
main : requested yarn user is wade
Writing to tmp file /export/sdh/yarn/nm/nmPrivate/application_1537320290255_262214/container_e26_1537320290255_262214_01_000024/container_e26_1537320290255_262214_01_000024.pid.tmp

Container exited with a non-zero exit code 1

Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 15, seps-hadoop-a02-4.seps.sz.hpc, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container marked as failed: container_e26_1537320290255_262214_01_000024 on host: seps-hadoop-a02-4.seps.sz.hpc. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e26_1537320290255_262214_01_000024
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

Shell output: main : command provided 1
main : run as user is wade
main : requested yarn user is wade
Writing to tmp file /export/sdh/yarn/nm/nmPrivate/application_1537320290255_262214/container_e26_1537320290255_262214_01_000024/container_e26_1537320290255_262214_01_000024.pid.tmp

Container exited with a non-zero exit code 1

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:139)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1063)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1127)
at is.hail.io.vcf.MatrixVCFReader.coercer$lzycompute(LoadVCF.scala:976)
at is.hail.io.vcf.MatrixVCFReader.coercer(LoadVCF.scala:976)
at is.hail.io.vcf.MatrixVCFReader.apply(LoadVCF.scala:1010)
at is.hail.expr.ir.MatrixRead.execute(MatrixIR.scala:427)
at is.hail.expr.ir.CastMatrixToTable.execute(TableIR.scala:1169)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply$mcJ$sp(Interpret.scala:695)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:695)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:695)
at scala.Option.getOrElse(Option.scala:121)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:695)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:107)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:77)
at is.hail.variant.MatrixTable.countRows(MatrixTable.scala:614)
at is.hail.variant.MatrixTable.count(MatrixTable.scala:612)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:744)

tpoterba · November 14, 2018, 2:30am

to confirm that pyspark alone is working, what happens if you do the following?

pyspark2

then

>>> spark.range(10).show()

whu_wade · November 14, 2018, 5:53am

pyspark2 works well:

spark.range(10).show()
±–+
| id|
±–+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
±–+

tpoterba · November 14, 2018, 12:45pm

What system are you running Hail on, and how did you install Hail?

whu_wade · November 14, 2018, 2:48pm

operating system information:

When run spark-shell, return the following text:

Install steps: (seems successful)
git clone https://github.com/hail-is/hail.git
cd hail/hail
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

tpoterba · November 14, 2018, 4:17pm

@danking @wang do you think the native libs are the problem?

danking · November 14, 2018, 7:16pm

Try

./gradlew nativeLibPrebuilt
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

Apologies, this is a known compilation bug when compiling manually.

whu_wade · November 15, 2018, 2:20am

I re-installed Hail with the two commands you supply, but same errors appear again when running hl.export_vcf() or count() action functions…

danking · November 15, 2018, 3:15pm

Oh sorry, I totally missed that this is wrong.

file:/// refers to a local file, file:///prefix_path/test.vcf almost certainly does not exist on every worker node. You need to put the VCF in HDFS.

Topic		Replies	Views
Hail 0.2 class not found exception on EMR Hail Query & hailctl	29	2780	August 20, 2018
HAIL 0.1: export vcf hadoop error Help [0.1]	7	1374	January 28, 2019
Container killed on request. Exit code is 137 Hail Query & hailctl	8	605	October 26, 2021
Table export report error Hail Query & hailctl	7	1051	July 30, 2020
Hail on windows Hail Query & hailctl	14	449	April 6, 2021

Run-time error when using spark-submit

Related topics