Export bgen from VDS for 14 million variants and 414k samples of AoU

Hi team !

Asking for some insights/tips

I have filtered around 14 million variants for 414k samples in chr6 of AoU from VDS and trying to export it into .bgen format…
The code snippets is as follows:

mt_bgen = mt_plink.annotate_rows(rsid = hl.delimit([mt_plink.locus.contig, hl.str(mt_plink.locus.position), mt_plink.alleles[0], mt_plink.alleles[1]], ':'))
gp_values = hl.literal([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]])
mt_bgen = mt_bgen.annotate_entries(gp = gp_values[mt_bgen.GT.n_alt_alleles()])
out_bgen = f'{bucket}/data1/test1_bgen'
hl.export_bgen(mt_bgen, out_bgen, gp=mt_bgen.gp)

and the spark configuration is as follows:

import hail as hl

hl.init(
    default_reference="GRCh38",
    spark_conf={
        "spark.executor.memory": "20g",   
"spark.executor.memoryOverhead": "1024",  
        "spark.executor.cores": "8",
        "spark.driver.cores": "6",
        "spark.executor.instances": "11",

        "spark.sql.shuffle.partitions": str(11 * 8 * 3),  # 132
        "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
        "spark.speculation": "true",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryoserializer.buffer.max": "1024m"
    }
)

and the spark job was able to run 33 stages but it failed in 34th stage, the stack trace are:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4131 in stage 34.0 failed 4 times, most recent failure: Lost task 4131.3 in stage 34.0 (TID 24188) (all-of-us-27572-sw-z5qr.us-central1-b.c.terra-vpc-sc-6e60477a.internal executor 31): ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Container from a bad node: container_e01_1747986637838_0004_01_000034 on host: all-of-us-27572-sw-z5qr.us-central1-b.c.terra-vpc-sc-6e60477a.internal. Exit status: 137. Diagnostics: [2025-05-24 01:55:56.103]Container killed on request. Exit code is 137
[2025-05-24 01:55:56.127]Container exited with a non-zero exit code 137. 
[2025-05-24 01:55:56.129]Killed by external signal
.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2452)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2473)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2505)
	at is.hail.sparkextras.ContextRDD.crunJobWithIndex(ContextRDD.scala:239)
	at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1117)
	at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1199)
	at is.hail.rvd.RVD$.coerce(RVD.scala:1157)
	at is.hail.rvd.RVD.changeKey(RVD.scala:146)
	at is.hail.rvd.RVD.changeKey(RVD.scala:139)
	at is.hail.backend.spark.SparkBackend.lowerDistributedSort(SparkBackend.scala:568)
	at is.hail.backend.Backend.lowerDistributedSort(Backend.scala:115)
	at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.$anonfun$apply$1(LowerAndExecuteShuffles.scala:19)
	at is.hail.expr.ir.RewriteBottomUp$.$anonfun$apply$2(RewriteBottomUp.scala:11)
	at is.hail.utils.StackSafe$More.advance(StackSafe.scala:60)
	at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
	at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
	at is.hail.expr.ir.RewriteBottomUp$.apply(RewriteBottomUp.scala:21)
	at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.apply(LowerAndExecuteShuffles.scala:16)
	at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.transform(LoweringPass.scala:180)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:37)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:98)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:37)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:98)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:35)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:34)
	at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.apply(LoweringPass.scala:174)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$2(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$2$adapted(LoweringPipeline.scala:20)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:20)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:98)
	at is.hail.backend.ExecuteContext.time(ExecuteContext.scala:183)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:11)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$1(CompileAndEvaluate.scala:48)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:98)
	at is.hail.backend.ExecuteContext.time(ExecuteContext.scala:183)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:550)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:98)
	at is.hail.backend.ExecuteContext.time(ExecuteContext.scala:183)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:539)
	at is.hail.backend.BackendHttpHandler.$anonfun$handle$4(BackendServer.scala:93)
	at is.hail.utils.package$.using(package.scala:673)
	at is.hail.backend.ExecuteContext.local(ExecuteContext.scala:220)
	at is.hail.backend.BackendHttpHandler.$anonfun$handle$3(BackendServer.scala:91)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:15)
	at is.hail.backend.BackendHttpHandler.$anonfun$handle$2(BackendServer.scala:90)
	at is.hail.backend.BackendHttpHandler.$anonfun$handle$2$adapted(BackendServer.scala:89)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:100)
	at is.hail.utils.package$.using(package.scala:673)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:100)
	at is.hail.utils.package$.using(package.scala:673)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:166)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$1(ExecuteContext.scala:83)
	at is.hail.utils.package$.using(package.scala:673)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:82)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:406)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:15)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:22)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:387)
	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:89)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:848)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:817)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:201)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:560)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:525)
	at java.base/java.lang.Thread.run(Thread.java:829)



Hail version: 0.2.134-952ae203dbbe
Error summary: SparkException: Job aborted due to stage failure: Task 4131 in stage 34.0 failed 4 times, most recent failure: Lost task 4131.3 in stage 34.0 (TID 24188) (all-of-us-27572-sw-z5qr.us-central1-b.c.terra-vpc-sc-6e60477a.internal executor 31): ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Container from a bad node: container_e01_1747986637838_0004_01_000034 on host: all-of-us-27572-sw-z5qr.us-central1-b.c.terra-vpc-sc-6e60477a.internal. Exit status: 137. Diagnostics: [2025-05-24 01:55:56.103]Container killed on request. Exit code is 137
[2025-05-24 01:55:56.127]Container exited with a non-zero exit code 137. 
[2025-05-24 01:55:56.129]Killed by external signal
.

Pls help me how to resolve this error and what is the optimized configuration to follow…