Run-time error when using spark-submit

My submit command like below, and get some error. I have test some hail function, import_vcf() is ok, but output errors when I use count(),show() ,take() or export_vcf() such action functions.

spark2-submit --master yarn --jars “$HAIL_HOME/build/libs/hail-all-spark.jar”
–py-files “$HAIL_HOME/build/distributions/hail-python.zip”
–conf spark.driver.extraClassPath="$HAIL_HOME/build/libs/hail-all-spark.jar"
–conf spark.executor.extraClassPath=./hail-all-spark.jar
–conf spark.serializer=org.apache.spark.serializer.KryoSerializer
–conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
–num-executors 10 --executor-memory 20g --executor-cores 4 stat_call_rate.py

#stat_call_rate.py
import hail as hl
import hail.expr.aggregators as agg
hl.init(app_name=‘hail_statCallRate’,log=‘cr.log’,tmp_dir=‘file:///path_prefix/population’,default_reference=‘GRCh38’)
vcf=hl.import_vcf(“file:///path_prefix/test.vcf”)
vcf=hl.sample_qc(vcf)
hl.export_vcf(vcf,“file:///path_prefix/test.vcf.bgz”)

ERROR:
[Stage 1:> (0 + 7) / 7]Traceback (most recent call last):
File “stat_call_rate.py”, line 8, in
hl.export_vcf(vcf,“file:///test_path/test.vcf.bgz”)
File “”, line 2, in export_vcf
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 560, in wrapper
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/methods/impex.py”, line 422, in export_vcf
File “/path_prefix/SPARK2/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/path_prefix/hail/build/distributions/hail-python.zip/hail/utils/java.py”, line 210, in deco
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 20, sz-hadoop-55-60.local, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1537175360952_0263_01_000012 on host: sz-hadoop-55-60.local. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1537175360952_0263_01_000012
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

Are you able to run pyspark on your cluster?

I tried just now:
pyspark2 --jars $HAIL_HOME/build/libs/hail-all-spark.jar --conf spark.driver.extraClassPath=$HAIL_HOME/build/libs/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator

#interactive command
import hail as hl
hl.init(sc)
vcf=hl.import_vcf(“file:///prefix_path/test.vcf”,reference_genome=‘GRCh38’) #that’s all OK until here
vcf.count() # error as below

[Stage 1:> (0 + 2) / 7]Traceback (most recent call last):
File “”, line 1, in
File “/prefix_path/hail/build/distributions/hail-python.zip/hail/matrixtable.py”, line 2117, in count
File “/prefix_path/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/prefix_path/hail/build/distributions/hail-python.zip/hail/utils/java.py”, line 210, in deco
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 15, seps-hadoop-a02-4.seps.sz.hpc, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container marked as failed: container_e26_1537320290255_262214_01_000024 on host: seps-hadoop-a02-4.seps.sz.hpc. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e26_1537320290255_262214_01_000024
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

Shell output: main : command provided 1
main : run as user is wade
main : requested yarn user is wade
Writing to tmp file /export/sdh/yarn/nm/nmPrivate/application_1537320290255_262214/container_e26_1537320290255_262214_01_000024/container_e26_1537320290255_262214_01_000024.pid.tmp

Container exited with a non-zero exit code 1

Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 15, seps-hadoop-a02-4.seps.sz.hpc, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container marked as failed: container_e26_1537320290255_262214_01_000024 on host: seps-hadoop-a02-4.seps.sz.hpc. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e26_1537320290255_262214_01_000024
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)

Shell output: main : command provided 1
main : run as user is wade
main : requested yarn user is wade
Writing to tmp file /export/sdh/yarn/nm/nmPrivate/application_1537320290255_262214/container_e26_1537320290255_262214_01_000024/container_e26_1537320290255_262214_01_000024.pid.tmp

Container exited with a non-zero exit code 1

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:139)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1063)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1127)
at is.hail.io.vcf.MatrixVCFReader.coercer$lzycompute(LoadVCF.scala:976)
at is.hail.io.vcf.MatrixVCFReader.coercer(LoadVCF.scala:976)
at is.hail.io.vcf.MatrixVCFReader.apply(LoadVCF.scala:1010)
at is.hail.expr.ir.MatrixRead.execute(MatrixIR.scala:427)
at is.hail.expr.ir.CastMatrixToTable.execute(TableIR.scala:1169)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply$mcJ$sp(Interpret.scala:695)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:695)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:695)
at scala.Option.getOrElse(Option.scala:121)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:695)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:107)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:77)
at is.hail.variant.MatrixTable.countRows(MatrixTable.scala:614)
at is.hail.variant.MatrixTable.count(MatrixTable.scala:612)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:744)

to confirm that pyspark alone is working, what happens if you do the following?

pyspark2

then

>>> spark.range(10).show()

pyspark2 works well:

spark.range(10).show()
±–+
| id|
±–+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
±–+

What system are you running Hail on, and how did you install Hail?

operating system information:
image

When run spark-shell, return the following text:

Install steps: (seems successful)
git clone https://github.com/hail-is/hail.git
cd hail/hail
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

@danking @wang do you think the native libs are the problem?

Try

./gradlew nativeLibPrebuilt
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

Apologies, this is a known compilation bug when compiling manually.

I re-installed Hail with the two commands you supply, but same errors appear again when running hl.export_vcf() or count() action functions…

Oh sorry, I totally missed that this is wrong.

file:/// refers to a local file, file:///prefix_path/test.vcf almost certainly does not exist on every worker node. You need to put the VCF in HDFS.