Container killed on request. Exit code is 137

Hi hail team!

I’m trying to run this script (on the freeze_7 branch) to re-export the 455k ukbb vcfs. I’m running into the same error Julia reported here when trying to generate the vcf mt:

Traceback (most recent call last):                         (564 + 7069) / 10757]
  File "/tmp/aab33e95af5b4042a1ce5ae82eab0362/prepare_vcf_data_release.py", line 827, in <module>
    main(args)
  File "/tmp/aab33e95af5b4042a1ce5ae82eab0362/prepare_vcf_data_release.py", line 620, in main
    mt.write(
  File "<decorator-gen-1257>", line 2, in write
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2544, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 110, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 86, in execute
    result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:945)
	at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:258)
	at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:72)
	at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:46)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:417)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:414)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.78-b17627756568
Error summary: SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [aab33e95af5b4042a1ce5ae82eab0362] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/aab33e95af5b4042a1ce5ae82eab0362?project=maclab-ukbb&region=us-central1
gcloud dataproc jobs wait 'aab33e95af5b4042a1ce5ae82eab0362' --region 'us-central1' --project 'maclab-ukbb'
https://console.cloud.google.com/storage/browser/dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/7d3d450b-655d-4c19-a713-0d7fb2e8d196/jobs/aab33e95af5b4042a1ce5ae82eab0362/
gs://dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/7d3d450b-655d-4c19-a713-0d7fb2e8d196/jobs/aab33e95af5b4042a1ce5ae82eab0362/driveroutput
Traceback (most recent call last):
  File "/Users/kchao/anaconda3/envs/hail/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/subprocess.py", line 328, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'prepare_vcf_data_release.py', '--cluster=kc', '--files=', '--py-files=/var/folders/xq/8jnhrt2s2h58ts2v0br5g8gm0000gp/T/pyscripts_5umzcx7s.zip', '--properties=', '--', '--prepare_vcf_mt', '--slack_channel', '@kc (she/her)']' returned non-zero exit status 1.

I used this cluster config:

hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --master-boot-disk-size 600 --preemptible-worker-boot-disk-size 500 --worker-boot-disk-size 500 --project maclab-ukbb --max-idle=30m --packages gnomad --num-workers 10 --requester-pays-allow-all

autoscale_densify:

This is the path to the log (I can also email the log; it’s 1GB): gs://broad-ukbb/broad.freeze_7/temp/logs.

Do you think updating my cluster configuration would help with this exit code 137 error? I’d appreciate any help!

P.S. Looking back on my notes, this step had some memory errors when I last ran it in March (specifically with this line: mt = hl.filter_intervals(mt, [hl.parse_locus_interval("chrM")], keep=False)). Once the issues associated with filter_intervals were fixed, however, I was able to create the vcf mt in 1 hour and 56 minutes.

Hey @ch-kr! Sorry you’re running into trouble. I’ll find someone to dig into this asap. Your cluster configuration seems fine to me. There are more drastic steps we could take, but let’s have someone dig into the logs a bit first.

1 Like

Could you try this again with ~300 max secondary workers?

Your log has the same inexplicable broadcast/communication errors we saw running the combiner last week, and we solved that by shrinking the cluster.

1 Like

thank you both for the quick replies!

@tpoterba I tried using a cluster with an autoscaling policy that scales up to 300 secondary workers and got this error:

Traceback (most recent call last):
  File "/tmp/2f9713fc7b344911be340e11e059c000/prepare_vcf_data_release.py", line 827, in <module>
    main(args)
  File "/tmp/2f9713fc7b344911be340e11e059c000/prepare_vcf_data_release.py", line 620, in main
    mt.write(
  File "<decorator-gen-1257>", line 2, in write
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2544, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 110, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 86, in execute
    result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job 31 cancelled because SparkContext was shut down

Java stack trace:
org.apache.spark.SparkException: Job 31 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1084)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1082)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1082)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2459)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2365)
	at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2075)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:2075)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:124)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:945)
	at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:258)
	at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:72)
	at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:46)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:417)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:414)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.78-b17627756568
Error summary: SparkException: Job 31 cancelled because SparkContext was shut down
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [2f9713fc7b344911be340e11e059c000] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/2f9713fc7b344911be340e11e059c000?project=maclab-ukbb&region=us-central1
gcloud dataproc jobs wait '2f9713fc7b344911be340e11e059c000' --region 'us-central1' --project 'maclab-ukbb'
https://console.cloud.google.com/storage/browser/dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/db6d0dc2-6efe-43b9-ad8b-b52c3bbc014b/jobs/2f9713fc7b344911be340e11e059c000/
gs://dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/db6d0dc2-6efe-43b9-ad8b-b52c3bbc014b/jobs/2f9713fc7b344911be340e11e059c000/driveroutput
Traceback (most recent call last):
  File "/Users/kchao/anaconda3/envs/hail/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/subprocess.py", line 328, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'prepare_vcf_data_release.py', '--cluster=kc', '--files=', '--py-files=/var/folders/xq/8jnhrt2s2h58ts2v0br5g8gm0000gp/T/pyscripts__uj2daed.zip', '--properties=', '--', '--prepare_vcf_mt', '--slack_channel', '@kc (she/her)', '--overwrite']' returned non-zero exit status 1.

I’ll email the log now

A non-update update: I’m digging into this further. I don’t yet have a planned resolution. It seems possible that Spark is having trouble with the ratio of primary workers to secondary workers. It’s not clear to me why or when this started happening.

1 Like

We got this one working by tweaking some Spark memory settings – should have updated the post.

thanks for the update (and for the reminder to post again)! using a smaller cluster (I set up a cluster to autoscale to 50 secondary workers and then adjusted it to scale to 200 secondary workers) with this flag --properties 'spark:spark.executor.memoryOverhead=12g' worked for this job. as a quick note, though, the job ran for almost 22 hours

Do you know what version you used to run this script in the past when it worked? I think we should run an experiment using that version to see if it works.

yup looks like 0.2.63

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.63-fee24a8ad25a