Hi hail team!
I’m trying to run this script (on the freeze_7
branch) to re-export the 455k ukbb vcfs. I’m running into the same error Julia reported here when trying to generate the vcf mt:
Traceback (most recent call last): (564 + 7069) / 10757]
File "/tmp/aab33e95af5b4042a1ce5ae82eab0362/prepare_vcf_data_release.py", line 827, in <module>
main(args)
File "/tmp/aab33e95af5b4042a1ce5ae82eab0362/prepare_vcf_data_release.py", line 620, in main
mt.write(
File "<decorator-gen-1257>", line 2, in write
File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2544, in write
Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 110, in execute
raise e
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 86, in execute
result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 29, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:945)
at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:258)
at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:72)
at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:46)
at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:417)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:414)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.78-b17627756568
Error summary: SparkException: Job aborted due to stage failure: Task 2333 in stage 37.0 failed 20 times, most recent failure: Lost task 2333.23 in stage 37.0 (TID 68209) (kc-sw-mv41.c.maclab-ukbb.internal executor 9488): ExecutorLostFailure (executor 9488 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634933184806_0003_01_009505 on host: kc-sw-mv41.c.maclab-ukbb.internal. Exit status: 137. Diagnostics: [2021-10-23 00:07:44.351]Container killed on request. Exit code is 137
[2021-10-23 00:07:44.351]Container exited with a non-zero exit code 137.
[2021-10-23 00:07:44.352]Killed by external signal
.
Driver stacktrace:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [aab33e95af5b4042a1ce5ae82eab0362] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/aab33e95af5b4042a1ce5ae82eab0362?project=maclab-ukbb®ion=us-central1
gcloud dataproc jobs wait 'aab33e95af5b4042a1ce5ae82eab0362' --region 'us-central1' --project 'maclab-ukbb'
https://console.cloud.google.com/storage/browser/dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/7d3d450b-655d-4c19-a713-0d7fb2e8d196/jobs/aab33e95af5b4042a1ce5ae82eab0362/
gs://dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/7d3d450b-655d-4c19-a713-0d7fb2e8d196/jobs/aab33e95af5b4042a1ce5ae82eab0362/driveroutput
Traceback (most recent call last):
File "/Users/kchao/anaconda3/envs/hail/bin/hailctl", line 8, in <module>
sys.exit(main())
File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
cli.main(args)
File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
jmp[args.module].main(args, pass_through_args)
File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
gcloud.run(cmd)
File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
return subprocess.check_call(["gcloud"] + command)
File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/subprocess.py", line 328, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'prepare_vcf_data_release.py', '--cluster=kc', '--files=', '--py-files=/var/folders/xq/8jnhrt2s2h58ts2v0br5g8gm0000gp/T/pyscripts_5umzcx7s.zip', '--properties=', '--', '--prepare_vcf_mt', '--slack_channel', '@kc (she/her)']' returned non-zero exit status 1.
I used this cluster config:
hailctl dataproc start kc --autoscaling-policy=autoscale_densify --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --master-boot-disk-size 600 --preemptible-worker-boot-disk-size 500 --worker-boot-disk-size 500 --project maclab-ukbb --max-idle=30m --packages gnomad --num-workers 10 --requester-pays-allow-all
autoscale_densify
:
This is the path to the log (I can also email the log; it’s 1GB): gs://broad-ukbb/broad.freeze_7/temp/logs
.
Do you think updating my cluster configuration would help with this exit code 137 error? I’d appreciate any help!
P.S. Looking back on my notes, this step had some memory errors when I last ran it in March (specifically with this line: mt = hl.filter_intervals(mt, [hl.parse_locus_interval("chrM")], keep=False)
). Once the issues associated with filter_intervals
were fixed, however, I was able to create the vcf mt in 1 hour and 56 minutes.