I am trying to run hail on a gcloud dataproc cluster and am having trouble figuring out the causes of the error messages I am receiving. I have 5 gvcfs that have been subsetted to just chr1 and am running the following:
$ hailctl dataproc start my_cluster_A --project=hail --debug-mode --master-machine-type=n1-highmem-32 --region=us-central1 --max-idle 1h --autoscaling-policy=20k-preemptibles --packages gnomad --master-boot-disk-size 1000
$ gcloud dataproc jobs submit pyspark test_combiner.py \
--cluster=my_cluster_A \
--project hail \
--files=gs://iw-hail/000-gvcfs/chr1/chr1_sample_map.tsv \
--region=us-central1 \
--driver-log-levels root=WARN \
-- \
-g gs://iw-hail/000-metadata-files/header.g.vcf.gz \
-s gs://iw-hail/chr1/chr1_sample_map.tsv \
-c gs://iw-hail/000-outputs/20211021.chr1.mt \
-t gs://iw-hail/000-temp/tmp/ \
-o
the test_combiner.py
file looks like this:
import hail as hl
import argparse
hl.init(log='/home/hail/combiner.log')
def get_args():
argparser = argparse.ArgumentParser(description=__doc__)
argparser.add_argument("--sample_map", "-s", required=True)
argparser.add_argument("--output_cloud_path", "-c", required=True)
argparser.add_argument("--tmp_bucket", "-t", required=True)
argparser.add_argument("--gvcf_header_file", "-g", required=True)
argparser.add_argument("--overwrite_existing", "-o", action='store_true')
return argparser.parse_args()
def get_gvcf_and_sample_from_map(sample_map):
gvcfs = []
samples = []
with hl.hadoop_open(sample_map, 'r') as f:
for line in f:
(sample, gvcf) = line.rstrip().split('\t')
gvcfs.append(gvcf)
samples.append(sample)
return gvcfs, samples
if __name__ == "__main__":
args = get_args()
gvcf_list, samples_list = get_gvcf_and_sample_from_map(args.sample_map)
hl.experimental.run_combiner(
gvcf_list,
sample_names=samples_list,
header=args.gvcf_header_file,
out_file=args.output_cloud_path,
tmp_path=args.tmp_bucket,
key_by_locus_and_alleles=True,
overwrite=args.overwrite_existing,
reference_genome='GRCh38',
use_genome_default_intervals=True,
target_records=10000
)
and the output looks like this
Job [eed02ccac5ff4641899791e464d3e738] submitted.
Waiting for job output...
Running on Apache Spark version 3.1.1
SparkUI available at http://my_cluster_A-m.c.hail.internal:44593
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.77-684f32d73643
LOGGING: writing to /home/hail/combiner.log
2021-10-21 16:10:21 Hail: INFO: Using 2586 intervals with default whole-genome size 1200000 as partitioning for GVCF import
2021-10-21 16:10:21 Hail: INFO: GVCF combiner plan:
Branch factor: 100
Phase 1 batch size: 100
Combining 5 input files in 1 phases with 1 total jobs.
Phase 1: 1 job corresponding to 1 final output file.
2021-10-21 16:10:21 Hail: INFO: Starting phase 1/1, merging 5 input GVCFs in 1 job.
2021-10-21 16:10:21 Hail: INFO: Starting phase 1/1, job 1/1 to create 1 merged file, corresponding to ~100.0% of total I/O.
[Stage 0:> (0 + 8) / 2586]
Traceback (most recent call last):=======================> (2478 + 132) / 2586]
File "/tmp/eed02ccac5ff4641899791e464d3e738/test_combiner.py", line 28, in <module>
hl.experimental.run_combiner(
File "/opt/conda/default/lib/python3.8/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py", line 708, in run_combiner
final_mt.write(out_file, overwrite=overwrite)
File "<decorator-gen-1257>", line 2, in write
File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2529, in write
Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
raise e
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 74, in execute
result = json.loads(self._jhc.backend().executeJSON(jir))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134
[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
.
Driver stacktrace:
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134
[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:286)
at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:28)
at __C57Compiled.__m93split_Let(Emit.scala)
at __C57Compiled.apply(Emit.scala)
at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$3(CompileAndEvaluate.scala:56)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:56)
at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:29)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:29)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:66)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:71)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:68)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:63)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:46)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.77-684f32d73643
Error summary: SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134
[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 5475 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :
.
Driver stacktrace:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [eed02ccac5ff4641899791e464d3e738] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/eed02ccac5ff4641899791e464d3e738?project=-hail®ion=us-central1
gcloud dataproc jobs wait 'eed02ccac5ff4641899791e464d3e738' --region 'us-central1' --project '-hail'
https://console.cloud.google.com/storage/browser/dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/eed02ccac5ff4641899791e464d3e738/
gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/eed02ccac5ff4641899791e464d3e738/driveroutput
The error codes seem to suggest that I am low on memory, but increasing the size of the cluster master machine, boot disk size, and autoscaling policy gives seemeingly the same error logs. I also don’t expect the 5 500MB gvcfs to be memory hungry at all.
I have verified that I am able to run hail in this environment:
$ cat test_py.py
import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=1000, n_variants=1000, n_partitions=64)
mt = hl.variant_qc(mt)
mt = hl.sample_qc(mt)
mt._force_count_rows()
$ gcloud dataproc jobs submit pyspark test_py.py --cluster=my_cluster --project hail --region=us-central1 --driver-log-levels root=WARN
Job [4c06a50c09924b558b989164f1814455] submitted.
Waiting for job output...
Initializing Hail with default parameters...
Running on Apache Spark version 3.1.1
SparkUI available at http://andersoncallsete-m.c.strokeanderson-hail.internal:36263
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.77-684f32d73643
LOGGING: writing to /home/hail/hail-20211021-1707-0.2.77-684f32d73643.log
2021-10-21 17:07:21 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2021-10-21 17:07:32 Hail: INFO: Coerced sorted dataset (26 + 8) / 64]
Job [4c06a50c09924b558b989164f1814455] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/
driverOutputResourceUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/driveroutput
jobUuid: 0f65b286-95ce-3693-a6cf-c5a53fa96b5d
placement:
clusterName: andersoncallsete
clusterUuid: 76ae5b0a-a898-4d74-b08b-39c8573307e9
pysparkJob:
loggingConfig:
driverLogLevels:
root: WARN
mainPythonFileUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/staging/test_py.py
reference:
jobId: 4c06a50c09924b558b989164f1814455
projectId: strokeanderson-hail
status:
state: DONE
stateStartTime: '2021-10-21T17:07:36.496212Z'
statusHistory:
- state: PENDING
stateStartTime: '2021-10-21T17:07:09.132834Z'
- state: SETUP_DONE
stateStartTime: '2021-10-21T17:07:09.181490Z'
- details: Agent reported job success
state: RUNNING
stateStartTime: '2021-10-21T17:07:09.407719Z'
yarnApplications:
- name: Hail
progress: 1.0
state: FINISHED
trackingUrl: http://andersoncallsete-m:8088/proxy/application_1634832053408_0004/
Is there any other stack trace I should be looking at?