Trouble with vcf_combiner on gcloud dataproc cluster

I am trying to run hail on a gcloud dataproc cluster and am having trouble figuring out the causes of the error messages I am receiving. I have 5 gvcfs that have been subsetted to just chr1 and am running the following:

$ hailctl dataproc start my_cluster_A  --project=hail  --debug-mode --master-machine-type=n1-highmem-32 --region=us-central1 --max-idle 1h --autoscaling-policy=20k-preemptibles --packages gnomad --master-boot-disk-size 1000

$ gcloud dataproc jobs submit pyspark test_combiner.py \
        --cluster=my_cluster_A \
        --project hail \
        --files=gs://iw-hail/000-gvcfs/chr1/chr1_sample_map.tsv \
        --region=us-central1 \
        --driver-log-levels root=WARN \
        -- \
        -g gs://iw-hail/000-metadata-files/header.g.vcf.gz \
        -s gs://iw-hail/chr1/chr1_sample_map.tsv  \
        -c gs://iw-hail/000-outputs/20211021.chr1.mt \
        -t gs://iw-hail/000-temp/tmp/ \
        -o

the test_combiner.py file looks like this:

import hail as hl
import argparse

hl.init(log='/home/hail/combiner.log')

def get_args():
    argparser = argparse.ArgumentParser(description=__doc__)
    argparser.add_argument("--sample_map", "-s", required=True)
    argparser.add_argument("--output_cloud_path", "-c", required=True)
    argparser.add_argument("--tmp_bucket", "-t", required=True)
    argparser.add_argument("--gvcf_header_file", "-g", required=True)
    argparser.add_argument("--overwrite_existing", "-o", action='store_true')
    return argparser.parse_args()

def get_gvcf_and_sample_from_map(sample_map):
    gvcfs = []
    samples = []
    with hl.hadoop_open(sample_map, 'r') as f:
       for line in f:
           (sample, gvcf) = line.rstrip().split('\t')
           gvcfs.append(gvcf)
           samples.append(sample)
    return gvcfs, samples

if __name__ == "__main__":
    args = get_args()
    gvcf_list, samples_list = get_gvcf_and_sample_from_map(args.sample_map)
    hl.experimental.run_combiner(
        gvcf_list,
        sample_names=samples_list,
        header=args.gvcf_header_file,
        out_file=args.output_cloud_path,
        tmp_path=args.tmp_bucket,
        key_by_locus_and_alleles=True,
        overwrite=args.overwrite_existing,
        reference_genome='GRCh38',
        use_genome_default_intervals=True,
        target_records=10000
    )

and the output looks like this

Job [eed02ccac5ff4641899791e464d3e738] submitted.
Waiting for job output...
Running on Apache Spark version 3.1.1
SparkUI available at http://my_cluster_A-m.c.hail.internal:44593
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.77-684f32d73643
LOGGING: writing to /home/hail/combiner.log
2021-10-21 16:10:21 Hail: INFO: Using 2586 intervals with default whole-genome size 1200000 as partitioning for GVCF import
2021-10-21 16:10:21 Hail: INFO: GVCF combiner plan:
    Branch factor: 100
    Phase 1 batch size: 100
    Combining 5 input files in 1 phases with 1 total jobs.
        Phase 1: 1 job corresponding to 1 final output file.

2021-10-21 16:10:21 Hail: INFO: Starting phase 1/1, merging 5 input GVCFs in 1 job.
2021-10-21 16:10:21 Hail: INFO: Starting phase 1/1, job 1/1 to create 1 merged file, corresponding to ~100.0% of total I/O.
[Stage 0:>                                                       (0 + 8) / 2586]
Traceback (most recent call last):=======================>  (2478 + 132) / 2586]
  File "/tmp/eed02ccac5ff4641899791e464d3e738/test_combiner.py", line 28, in <module>
    hl.experimental.run_combiner(
  File "/opt/conda/default/lib/python3.8/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py", line 708, in run_combiner
    final_mt.write(out_file, overwrite=overwrite)
  File "<decorator-gen-1257>", line 2, in write
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2529, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134

[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :


[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :


.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134

[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :


[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :

.
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:286)
        at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:28)
        at __C57Compiled.__m93split_Let(Emit.scala)
        at __C57Compiled.apply(Emit.scala)
        at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$3(CompileAndEvaluate.scala:56)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:56)
        at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:29)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:29)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:66)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:71)
        at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:68)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
        at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
        at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:63)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:46)
        at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
        at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
        at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
        at is.hail.utils.package$.using(package.scala:638)
        at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
        at is.hail.utils.package$.using(package.scala:638)
        at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
        at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
		at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
        at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
        at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
        at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
        at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.77-684f32d73643
Error summary: SparkException: Job aborted due to stage failure: Task 63 in stage 0.0 failed 20 times, most recent failure: Lost task 63.19 in stage 0.0 (TID 4312) (my_cluster_A-sw-m1kj.c.-hail.internal executor 737): ExecutorLostFailure (executor 737 exited caused by one of the running tasks) Reason: Container from a bad node: container_1634832053408_0002_01_000744 on host: my_cluster_A-sw-m1kj.c.-hail.internal. Exit status: 134. Diagnostics: [2021-10-21 16:16:52.668]Exception from container-launch.
Container id: container_1634832053408_0002_01_000744
Exit code: 134

[2021-10-21 16:16:52.670]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :


[2021-10-21 16:16:52.671]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1:  5475 Aborted                 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx12022m '-Xss4M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:-OmitStackTraceInFastThrow' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/tmp '-Dspark.driver.port=37383' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@my_cluster_A-m.c.-hail.internal:37383 --executor-id 737 --hostname my_cluster_A-sw-m1kj.c.-hail.internal --cores 4 --app-id application_1634832053408_0002 --resourceProfileId 0 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1634832053408_0002/container_1634832053408_0002_01_000744/hail-all-spark.jar > /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stdout 2> /var/log/hadoop-yarn/userlogs/application_1634832053408_0002/container_1634832053408_0002_01_000744/stderr
Last 4096 bytes of stderr :


.
Driver stacktrace:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [eed02ccac5ff4641899791e464d3e738] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/eed02ccac5ff4641899791e464d3e738?project=-hail&region=us-central1
gcloud dataproc jobs wait 'eed02ccac5ff4641899791e464d3e738' --region 'us-central1' --project '-hail'
https://console.cloud.google.com/storage/browser/dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/eed02ccac5ff4641899791e464d3e738/
gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/eed02ccac5ff4641899791e464d3e738/driveroutput	

The error codes seem to suggest that I am low on memory, but increasing the size of the cluster master machine, boot disk size, and autoscaling policy gives seemeingly the same error logs. I also don’t expect the 5 500MB gvcfs to be memory hungry at all.

I have verified that I am able to run hail in this environment:

$ cat test_py.py
import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=1000, n_variants=1000, n_partitions=64)
mt = hl.variant_qc(mt)
mt = hl.sample_qc(mt)
mt._force_count_rows()

$ gcloud dataproc jobs submit pyspark test_py.py --cluster=my_cluster --project hail --region=us-central1 --driver-log-levels root=WARN
Job [4c06a50c09924b558b989164f1814455] submitted.
Waiting for job output...
Initializing Hail with default parameters...
Running on Apache Spark version 3.1.1
SparkUI available at http://andersoncallsete-m.c.strokeanderson-hail.internal:36263
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.77-684f32d73643
LOGGING: writing to /home/hail/hail-20211021-1707-0.2.77-684f32d73643.log
2021-10-21 17:07:21 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2021-10-21 17:07:32 Hail: INFO: Coerced sorted dataset            (26 + 8) / 64]
Job [4c06a50c09924b558b989164f1814455] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/
driverOutputResourceUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/driveroutput
jobUuid: 0f65b286-95ce-3693-a6cf-c5a53fa96b5d
placement:
  clusterName: andersoncallsete
  clusterUuid: 76ae5b0a-a898-4d74-b08b-39c8573307e9
pysparkJob:
  loggingConfig:
    driverLogLevels:
      root: WARN
  mainPythonFileUri: gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/76ae5b0a-a898-4d74-b08b-39c8573307e9/jobs/4c06a50c09924b558b989164f1814455/staging/test_py.py
reference:
  jobId: 4c06a50c09924b558b989164f1814455
  projectId: strokeanderson-hail
status:
  state: DONE
  stateStartTime: '2021-10-21T17:07:36.496212Z'
statusHistory:
- state: PENDING
  stateStartTime: '2021-10-21T17:07:09.132834Z'
- state: SETUP_DONE
  stateStartTime: '2021-10-21T17:07:09.181490Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2021-10-21T17:07:09.407719Z'
yarnApplications:
- name: Hail
  progress: 1.0
  state: FINISHED
  trackingUrl: http://andersoncallsete-m:8088/proxy/application_1634832053408_0004/

Is there any other stack trace I should be looking at?

Could you include this line after the init to see if it fixes things:

hl._set_flags(no_whole_stage_codegen='1')

Thank you! Everything seems to work so far