Container exited with a non-zero exit code 134 with hl.summarize_variants

Hi Hail team!

I ran into an error running hl.summarize_variants(prepared_vcf_ht, show=False) on the gnomAD v3.1.2 release HT as part of our validity checks.

Traceback (most recent call last):                            (18 + 4) / 115376]
  File "/tmp/44ce72fd6f334daa949881b38b137d58/prepare_vcf_data_release.py", line 994, in <module>
    main(args)
  File "/tmp/44ce72fd6f334daa949881b38b137d58/prepare_vcf_data_release.py", line 888, in main
    var_summary = hl.summarize_variants(prepared_vcf_ht, show=False)
  File "<decorator-gen-1761>", line 2, in summarize_variants
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/methods/qc.py", line 1143, in summarize_variants
    (allele_types, nti, ntv), contigs, allele_counts, n_variants = ht.aggregate(
  File "<decorator-gen-1117>", line 2, in aggregate
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 1178, in aggregate
    return Env.backend().execute(agg_ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 20 times, most recent failure: Lost task 5.19 in stage 0.0 (TID 166) (chr22-w-1.c.broad-mpg-gnomad.internal executor 31): ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Container from a bad node: container_1633643382464_0006_01_000032 on host: chr22-w-1.c.broad-mpg-gnomad.internal. Exit status: 134. Diagnostics: [2021-10-07 23:24:09.819]Exception from container-launch.
Container id: container_1633643382464_0006_01_000032
Exit code: 134

[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 20 times, most recent failure: Lost task 5.19 in stage 0.0 (TID 166) (chr22-w-1.c.broad-mpg-gnomad.internal executor 31): ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Container from a bad node: container_1633643382464_0006_01_000032 on host: chr22-w-1.c.broad-mpg-gnomad.internal. Exit status: 134. Diagnostics: [2021-10-07 23:24:09.819]Exception from container-launch.
Container id: container_1633643382464_0006_01_000032
Exit code: 134

[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:286)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:28)
	at __C165Compiled.__m287split_StreamFor(Emit.scala)
	at __C165Compiled.__m277begin_group_0(Emit.scala)
	at __C165Compiled.__m249split_RunAgg(Emit.scala)
	at __C165Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$6(CompileAndEvaluate.scala:67)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:67)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:29)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:29)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:66)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:52)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:71)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:68)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:63)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:46)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.77-684f32d73643
Error summary: SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 20 times, most recent failure: Lost task 5.19 in stage 0.0 (TID 166) (chr22-w-1.c.broad-mpg-gnomad.internal executor 31): ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Container from a bad node: container_1633643382464_0006_01_000032 on host: chr22-w-1.c.broad-mpg-gnomad.internal. Exit status: 134. Diagnostics: [2021-10-07 23:24:09.819]Exception from container-launch.
Container id: container_1633643382464_0006_01_000032
Exit code: 134

[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


[2021-10-07 23:24:09.821]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 22773 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/tmp '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' '-Dspark.driver.port=44539' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@chr22-m.c.broad-mpg-gnomad.internal:44539 --executor-id 31 --hostname chr22-w-1.c.broad-mpg-gnomad.internal --cores 4 --app-id application_1633643382464_0006 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1633643382464_0006/container_1633643382464_0006_01_000032/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stdout 2> /var/log/hadoop-yarn/userlogs/application_1633643382464_0006/container_1633643382464_0006_01_000032/stderr
Last 4096 bytes of stderr :


.
Driver stacktrace:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [44ce72fd6f334daa949881b38b137d58] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/44ce72fd6f334daa949881b38b137d58?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait '44ce72fd6f334daa949881b38b137d58' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/65f118cd-a330-4c0c-bb08-de0d1a27f33f/jobs/44ce72fd6f334daa949881b38b137d58/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/65f118cd-a330-4c0c-bb08-de0d1a27f33f/jobs/44ce72fd6f334daa949881b38b137d58/driveroutput
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'PycharmProjects/gnomad_qc/gnomad_qc/v3/create_release/prepare_vcf_data_release.py', '--cluster=chr22', '--files=', '--py-files=/var/folders/sj/6gr3x1553r5f7tkzsjy99mgs64sz0g/T/pyscripts_eqffej89.zip', '--properties=', '--', '--validity_check']' returned non-zero exit status 1.

I ran this on a very similar HT earlier this year using Hail version 0.2.62-84fa81b9ea3d with the same cluster configuration and previously got no error. Any thoughts?

Log file is too big, but I can email it.

Thank you!

-Julia

Hey @jkgoodrich !

Exit code 134 almost always means out of memory. Is this from this file on master of gnomad_qc? I can’t find that exactly line. Can you point me at the code you’re currently executing?

What goes into the computation of prepared_vcf_ht? Is there a densify in there?

A simple but expensive thing to try is using highmems or, if that fails, a “Konrad special” (what Chris described in the other thread).

Hey Dan! Thanks for the very quick response!

So the place where it fails is here: gnomad_qc/prepare_vcf_data_release.py at 5df5876b51e1ee73d4f530220995ccb1fb63fc10 · broadinstitute/gnomad_qc · GitHub

However, I isolated it to this line in gnomad_methods gnomad_methods/validity_checks.py at a389dfb1f0c0e7ff9a4dc7c05942474348f0d15a · broadinstitute/gnomad_methods · GitHub

So I simplified the code to just read in the table created here gnomad_qc/prepare_vcf_data_release.py at 5df5876b51e1ee73d4f530220995ccb1fb63fc10 · broadinstitute/gnomad_qc · GitHub and do the hl.summarize_variants on it.

I did try highmem’s and it didn’t work, so I actually decided to go back to the hail version where this worked last time (0.2.62) and try running it (again modified to only do that line not the full validity checks). If I tried to run it on the prepared_vcf_ht created by Hail 0.2.77 it failed with the same error, but if I remade that prepared_vcf_ht with 0.2.62 it ran fine, no difference in cluster, only difference was Hail version.

So on both I ran:

hailctl dataproc submit clustername PycharmProjects/gnomad_qc/gnomad_qc/v3/create_release/prepare_vcf_data_release.py --prepare_vcf_ht --validity_check --pyfiles PycharmProjects/gnomad_qc/gnomad_qc

0.2.62 results:

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.62-84fa81b9ea3d
LOGGING: writing to /vcf_release.log
INFO (vcf_release 561): Starting preparation of VCF HT...
INFO (vcf_release 562): Adding non-PAR annotation...
INFO (vcf_release 569): Unfurling nested gnomAD frequency annotations and add to INFO field...
INFO (vcf_release 442): Unfurling freq data...
INFO (vcf_release 464): Adding popmax data...
INFO (vcf_release 475): Unfurling faf data...
INFO (vcf_release 489): Unfurling age hists...
INFO (vcf_release 603): Reformatting rsid...
INFO (vcf_release 609): Reformatting VEP annotation...
INFO (vcf_release 612): Constructing INFO field
INFO (vcf_release 648): Rearranging fields to desired order...
INFO (vcf_release 846): Checkpointing prepared VCF HT for validity checks and export...
[Stage 0:================================================>(115375 + 2) / 115376]2021-10-08 03:56:34 Hail: INFO: wrote table with 759302267 rows in 115376 partitions to gs://gnomad-tmp/gnomad_v3.1.2_qc_data/vcf_prep.ht
    Total size: 2.08 TiB
    * Rows: 2.08 TiB
    * Globals: 21.12 KiB
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  99193 rows (218.99 MiB)
----------------------------------------
INFO (vcf_release 887): Beginning validity checks on the prepared VCF HT...
[Stage 1:================================================>(115375 + 2) / 115376]
Struct(allele_types={'Complex': 954, 'Deletion': 56664279, 'Insertion': 53629763, 'SNP': 649007271}, contigs={'chr11': 35078200, 'chr5': 46413277, 'chr22': 11606640, 'chr8': 39937920, 'chr19': 17290209, 'chrY': 1167367, 'chr1': 59159991, 'chr15': 22224163, 'chr12': 34331057, 'chr18': 19466619, 'chr20': 16354200, 'chr2': 62929395, 'chr13': 24993787, 'chr7': 42467077, 'chr14': 23566763, 'chr3': 51232827, 'chr17': 21943471, 'chr4': 50144242, 'chr6': 43702087, 'chr9': 33830205, 'chrX': 30290736, 'chr10': 35496856, 'chr21': 10958207, 'chr16': 24716971}, allele_counts={2: 759302267}, n_variants=759302267, r_ti_tv=1.5227983922109982)
INFO (vcf_release 967): Copying hail log to logging bucket...
2021-10-08 03:59:01 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v3.1.2_qc_data/logs/vcf_export.log'...
Job [01b8b041ab094c64aa2b16f1b263a293] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/217711e3-f815-4023-ac0b-153f09ed3255/jobs/01b8b041ab094c64aa2b16f1b263a293/
driverOutputResourceUri: gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/217711e3-f815-4023-ac0b-153f09ed3255/jobs/01b8b041ab094c64aa2b16f1b263a293/driveroutput
jobUuid: fcb41a02-8d8c-34f1-82c5-251d75048c2a
placement:
  clusterName: jg2
  clusterUuid: 217711e3-f815-4023-ac0b-153f09ed3255
pysparkJob:
  args:
  - --prepare_vcf_ht
  - --validity_check
  mainPythonFileUri: gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/217711e3-f815-4023-ac0b-153f09ed3255/jobs/01b8b041ab094c64aa2b16f1b263a293/staging/prepare_vcf_data_release.py
  pythonFileUris:
  - gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/217711e3-f815-4023-ac0b-153f09ed3255/jobs/01b8b041ab094c64aa2b16f1b263a293/staging/pyscripts_c0h93ypa.zip
reference:
  jobId: 01b8b041ab094c64aa2b16f1b263a293
  projectId: broad-mpg-gnomad
status:
  state: DONE
  stateStartTime: '2021-10-08T03:59:07.086188Z'
statusHistory:
- state: PENDING
  stateStartTime: '2021-10-08T03:33:34.083878Z'
- state: SETUP_DONE
  stateStartTime: '2021-10-08T03:33:34.124246Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2021-10-08T03:33:34.359443Z'
yarnApplications:
- name: Hail
  progress: 1.0
  state: FINISHED
  trackingUrl: http://jg2-m:8088/proxy/application_1633653700183_0009/

0.2.77 result:

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.77-684f32d73643
LOGGING: writing to /vcf_release.log
INFO (vcf_release 561): Starting preparation of VCF HT...
INFO (vcf_release 562): Adding non-PAR annotation...
INFO (vcf_release 568): Unfurling nested gnomAD frequency annotations and add to INFO field...
INFO (vcf_release 442): Unfurling freq data...
INFO (vcf_release 464): Adding popmax data...
INFO (vcf_release 475): Unfurling faf data...
INFO (vcf_release 489): Unfurling age hists...
INFO (vcf_release 603): Reformatting rsid...
INFO (vcf_release 609): Reformatting VEP annotation...
INFO (vcf_release 612): Constructing INFO field
INFO (vcf_release 648): Rearranging fields to desired order...
INFO (vcf_release 845): Checkpointing prepared VCF HT for validity checks and export...
2021-10-08 03:29:54 Hail: INFO: wrote table with 759302267 rows in 115376 partitions to gs://gnomad-tmp/gnomad_v3.1.2_qc_data/vcf_prep2.ht
INFO (vcf_release 887): Beginning validity checks on the prepared VCF HT...
INFO (vcf_release 967): Copying hail log to logging bucket...9 + 1458) / 115376]
2021-10-08 03:30:11 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v3.1.2_qc_data/logs/vcf_export.log'...
Traceback (most recent call last):                        (1228 + 196) / 115376]
  File "/tmp/bd831e45e5cc4d2288a5e24e81038e50/prepare_vcf_data_release.py", line 1006, in <module>
    main(args)
  File "/tmp/bd831e45e5cc4d2288a5e24e81038e50/prepare_vcf_data_release.py", line 900, in main
    var_summary = hl.summarize_variants(prepared_vcf_ht.select().select_globals(), show=False)
  File "<decorator-gen-1761>", line 2, in summarize_variants
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/methods/qc.py", line 1143, in summarize_variants
    (allele_types, nti, ntv), contigs, allele_counts, n_variants = ht.aggregate(
  File "<decorator-gen-1117>", line 2, in aggregate
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 1178, in aggregate
    return Env.backend().execute(agg_ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: HailException: Premature end of file: expected 4 bytes, found 0

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 20 times, most recent failure: Lost task 9.19 in stage 1.0 (TID 119044) (jg1-sw-55xf.c.broad-mpg-gnomad.internal executor 64): is.hail.utils.HailException: Premature end of file: expected 4 bytes, found 0
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.utils.richUtils.RichInputStream$.readFully$extension1(RichInputStream.scala:13)
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:546)
	at is.hail.io.LZ4InputBlockBuffer.skipBytesReadRemainder(InputBuffers.scala:567)
	at is.hail.io.BlockingInputBuffer.skipBytes(InputBuffers.scala:508)
	at is.hail.io.LEB128InputBuffer.skipBytes(InputBuffers.scala:272)
	at __C5722collect_distributed_array.__m5792SKIP_r_binary(Unknown Source)
	at __C5722collect_distributed_array.__m5791SKIP_o_array_of_r_binary(Unknown Source)
	at __C5722collect_distributed_array.__m5782SKIP_r_struct_of_o_int32ANDo_int32ANDo_float64ANDo_binaryANDo_float64ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64AN(Unknown Source)
	at __C5722collect_distributed_array.__m5777DECODE_r_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryANDr_struct_of_o_int32ANDo_int32ANDo_float64ANDo_binaryANDo_float64ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo(Unknown Source)
	at __C5722collect_distributed_array.__m5772split_StreamFor(Unknown Source)
	at __C5722collect_distributed_array.__m5762begin_group_0(Unknown Source)
	at __C5722collect_distributed_array.__m5733split_RunAgg(Unknown Source)
	at __C5722collect_distributed_array.apply(Unknown Source)
	at __C5722collect_distributed_array.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$2(BackendUtils.scala:31)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:144)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$1(BackendUtils.scala:30)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:723)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:286)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:28)
	at __C5589Compiled.__m5711split_StreamFor(Emit.scala)
	at __C5589Compiled.__m5701begin_group_0(Emit.scala)
	at __C5589Compiled.__m5673split_RunAgg(Emit.scala)
	at __C5589Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$6(CompileAndEvaluate.scala:67)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:67)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:29)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:29)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:66)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:52)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:71)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:68)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:63)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:46)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

is.hail.utils.HailException: Premature end of file: expected 4 bytes, found 0
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.utils.richUtils.RichInputStream$.readFully$extension1(RichInputStream.scala:13)
	at is.hail.io.StreamBlockInputBuffer.readBlock(InputBuffers.scala:546)
	at is.hail.io.LZ4InputBlockBuffer.skipBytesReadRemainder(InputBuffers.scala:567)
	at is.hail.io.BlockingInputBuffer.skipBytes(InputBuffers.scala:508)
	at is.hail.io.LEB128InputBuffer.skipBytes(InputBuffers.scala:272)
	at __C5722collect_distributed_array.__m5792SKIP_r_binary(Unknown Source)
	at __C5722collect_distributed_array.__m5791SKIP_o_array_of_r_binary(Unknown Source)
	at __C5722collect_distributed_array.__m5782SKIP_r_struct_of_o_int32ANDo_int32ANDo_float64ANDo_binaryANDo_float64ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64AN(Unknown Source)
	at __C5722collect_distributed_array.__m5777DECODE_r_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryANDr_struct_of_o_int32ANDo_int32ANDo_float64ANDo_binaryANDo_float64ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo_int32ANDo_int32ANDo_float64ANDo_int32ANDo(Unknown Source)
	at __C5722collect_distributed_array.__m5772split_StreamFor(Unknown Source)
	at __C5722collect_distributed_array.__m5762begin_group_0(Unknown Source)
	at __C5722collect_distributed_array.__m5733split_RunAgg(Unknown Source)
	at __C5722collect_distributed_array.apply(Unknown Source)
	at __C5722collect_distributed_array.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$2(BackendUtils.scala:31)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:144)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$1(BackendUtils.scala:30)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:723)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.77-684f32d73643
Error summary: HailException: Premature end of file: expected 4 bytes, found 0
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bd831e45e5cc4d2288a5e24e81038e50] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/bd831e45e5cc4d2288a5e24e81038e50?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait 'bd831e45e5cc4d2288a5e24e81038e50' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/1cb69512-dd9f-4878-863c-997c97f51dcb/jobs/bd831e45e5cc4d2288a5e24e81038e50/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/1cb69512-dd9f-4878-863c-997c97f51dcb/jobs/bd831e45e5cc4d2288a5e24e81038e50/driveroutput
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/usr/local/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'PycharmProjects/gnomad_qc/gnomad_qc/v3/create_release/prepare_vcf_data_release.py', '--cluster=jg1', '--files=', '--py-files=/var/folders/sj/6gr3x1553r5f7tkzsjy99mgs64sz0g/T/pyscripts_limj98_f.zip', '--properties=', '--', '--prepare_vcf_ht', '--validity_check']' returned non-zero exit status 1.

The error does look different though. I will send both log files.

Thanks!

1 Like

Wow thank you for the detailed investigation and information. I’m really sorry the latest version of Hail introduced this bug. We will dig into the cause.

Do I correctly understand that you’ve worked around the Hail bug in 0.2.77 by using 0.2.62? Or do you need to do this work with Hail 0.2.77?

EDIT: Have you tried (can you try) reading the file made by 0.2.62 using 0.2.77? That will help us isolate the issue. Thank you kindly!

EDIT2: Julia, do you still have the table created by Hail 0.2.77? Can you grant Hail team (me, John C, Chris V, and Tim P) access to that Hail table for further debugging purposes? Also, if you still have the table made by 0.2.62 can we have access to that one?

I think it will work to just use 0.2.62, so I can move forward, I’m trying to run the whole thing now.

Yes hl.summarize_variants works fine using 0.2.77 on the table created with 0.2.62.

I have both tables still, and I think I gave you all access. You, Chris, and Tim already had access I think.

Created with 0.2.62:
gs://gnomad-tmp/gnomad_v3.1.2_qc_data/vcf_prep.ht

Created with 0.2.77:
gs://gnomad-tmp/gnomad_v3.1.2_qc_data/vcf_prep2.ht

Note these will be automatically deleted after 14 days, and they were created yesterday.

Thank you Dan!

I found a difference between 0.2.62 and 0.2.77 that leads to an error on 0.2.62, but not 0.2.77.

Since the last time I ran the code we added in this line: gnomad_methods/validity_checks.py at ee3a4fbe6aa77a59274e53ba4ff02bcd5417a2f3 · broadinstitute/gnomad_methods · GitHub

This works on 0.2.77, but not on 0.2.62. Do you think it’s OK to use the HT created by 0.2.62, but run the rest of the code with 0.2.77?

INFO (gnomad.assessment.validity_checks 915): VARIANT FILTER SUMMARIES:
[Stage 2:=================================================>         (5 + 1) / 6]INFO (vcf_release 953): Copying hail log to logging bucket...
2021-10-08 15:29:37 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v3.1.2_qc_data/logs/vcf_export.log'...
Traceback (most recent call last):
  File "/tmp/69eab6361fcc4f30b60a6a7a3eb2f135/prepare_vcf_data_release.py", line 992, in <module>
    main(args)
  File "/tmp/69eab6361fcc4f30b60a6a7a3eb2f135/prepare_vcf_data_release.py", line 897, in main
    single_filter_count=True,
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/assessment/validity_checks.py", line 921, in validate_release_t
    monoallelic_expr,
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/assessment/validity_checks.py", line 257, in summarize_variant_filters
    filters = t.aggregate(hl.agg.counter(t.filters))
  File "<decorator-gen-1089>", line 2, in aggregate
  File "/opt/conda/default/lib/python3.6/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/table.py", line 1178, in aggregate
    return Env.backend().execute(agg_ir)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 75, in execute
    value = ir.typ._from_json(result['value'])
  File "/opt/conda/default/lib/python3.6/site-packages/hail/expr/types.py", line 253, in _from_json
    return self._convert_from_json_na(x)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/expr/types.py", line 259, in _convert_from_json_na
    return self._convert_from_json(x)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/expr/types.py", line 997, in _convert_from_json
    elt in x}
  File "/opt/conda/default/lib/python3.6/site-packages/hail/expr/types.py", line 997, in <dictcomp>
    elt in x}
TypeError: unhashable type: 'set'

Yes, newer versions of Hail 0.2 can read files made by older versions of Hail 0.2, so using 0.2.77 on the 0.2.66 file is fine.

@jkgoodrich, This is proving difficult to debug. How difficult would it be for us to rerun the process that generated these tables you shared? We wouldn’t need to do it for all the variants, just the first few partitions as a sample.