Error in calling vcf_combiner

Hi!

I am running into an java.lang.ArrayIndexOutOfBoundsException error when trying to call hail’s gvcf combiner (hail.experimental.run_combiner). I’ve attached the full error file below. Do you have any suggestions or ideas on where to go from here as far as diagnosing what might be causing the error? My input files appear to be properly formatted as far as I can tell.

Here is a rough copy of my code

import hail as hl

# ....................#
## Parse user inputs ##
# ....................#

    hl.experimental.run_combiner(
        gvcf_list,
        sample_names=samples_list,
        header=args.gvcf_header_file,
        out_file=args.output_cloud_path,
        tmp_path=args.tmp_bucket,
        key_by_locus_and_alleles=True,
        overwrite=args.overwrite_existing,
        reference_genome='GRCh38',
        use_exome_default_intervals=True,
        target_records=10000
    )

Error Message:

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/hail/combiner.log
2021-06-21 18:37:37 Hail: INFO: Using 65 intervals with default exome size 60000000 as partitioning for GVCF import
2021-06-21 18:37:37 Hail: INFO: GVCF combiner plan:
    Branch factor: 100
    Batch size: 100
    Combining 5 input files in 1 phases with 1 total jobs.
        Phase 1: 1 job corresponding to 1 final output file.

2021-06-21 18:37:37 Hail: INFO: Starting phase 1/1, merging 5 input GVCFs in 1 job.
2021-06-21 18:37:37 Hail: INFO: Starting phase 1/1, job 1/1 to create 1 merged file, corresponding to ~100.0% of total I/O.

[Stage 0:>                                                         (0 + 8) / 65]
[Stage 0:>                                                        (0 + 15) / 65]
[Stage 0:>                                                         (0 + 8) / 65]
[Stage 0:>                                                        (0 + 12) / 65]Traceback (most recent call last):
  File "/tmp/a335abd27f1041da8eaffc174c60366b/test_combiner.py", line 38, in <module>
    target_records=10000
  File "/opt/conda/default/lib/python3.6/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py", line 681, in run_combiner
    final_mt.write(out_file, overwrite=overwrite)
  File "<decorator-gen-1231>", line 2, in write
  File "/opt/conda/default/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/matrixtable.py", line 2528, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: ArrayIndexOutOfBoundsException: null

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 20 times, most recent failure: Lost task 2.19 in stage 0.0 (TID 145, test-w-1.c.strokeanderson-hail.internal, executor 2): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1892)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1880)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2113)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2062)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2051)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:952)
	at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:246)
	at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:61)
	at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:40)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.ArrayIndexOutOfBoundsException: null

This is a super weird stack trace. Where are you running this code? Local/Cloud? I think it would also help to update to the latest Hail version so line numbers are current for our debugging purposes.

I am running this through the Broad’s Terra platform. I updated to the latest hail version, here is the new error log file. The code is still the same as before.

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.71-f3a54b530979
LOGGING: writing to /home/hail/combiner.log

2021-07-15 20:27:55 Hail: INFO: Using 65 intervals with default exome size 60000000 as partitioning for GVCF import
2021-07-15 20:27:55 Hail: INFO: GVCF combiner plan:
    Branch factor: 100
    Phase 1 batch size: 100
    Combining 5 input files in 1 phases with 1 total jobs.
        Phase 1: 1 job corresponding to 1 final output file.

2021-07-15 20:27:55 Hail: INFO: Starting phase 1/1, merging 5 input GVCFs in 1 job.
2021-07-15 20:27:55 Hail: INFO: Starting phase 1/1, job 1/1 to create 1 merged file, corresponding to ~100.0% of total I/O.

[Stage 0:>                                                         (0 + 8) / 65]
Traceback (most recent call last):
  File "/tmp/6fb884c676ca495084aafbe35adbf283/test_combiner.py", line 36, in <module>
    hl.experimental.run_combiner(
  File "/opt/conda/default/lib/python3.8/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py", line 705, in run_combiner
    final_mt.write(out_file, overwrite=overwrite)
  File "<decorator-gen-1237>", line 2, in write
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2529, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: ArrayIndexOutOfBoundsException: null

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 20 times, most recent failure: Lost task 0.19 in stage 0.0 (TID 166) (test-w-0.c.strokeanderson-hail.internal executor 1): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:978)
	at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:257)
	at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:67)
	at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:45)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:790)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:56)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:29)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:627)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:627)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.ArrayIndexOutOfBoundsException: null
	at 




Hail version: 0.2.71-f3a54b530979
Error summary: ArrayIndexOutOfBoundsException: null

[Stage 0:>                                                         (0 + 1) / 65]
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [6fb884c676ca495084aafbe35adbf283] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/6fb884c676ca495084aafbe35adbf283?project=strokeanderson-hail&region=us-central1
gcloud dataproc jobs wait '6fb884c676ca495084aafbe35adbf283' --region 'us-central1' --project 'strokeanderson-hail'
https://console.cloud.google.com/storage/browser/dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/8de1d632-8d33-4726-b7e4-9c5881b14378/jobs/6fb884c676ca495084aafbe35adbf283/
gs://dataproc-staging-us-central1-538833006791-3bmpl5lg/google-cloud-dataproc-metainfo/8de1d632-8d33-4726-b7e4-9c5881b14378/jobs/6fb884c676ca495084aafbe35adbf283/driveroutput
Submitting to cluster 'test'...
gcloud command:
gcloud dataproc jobs submit pyspark /test_combiner.py \
    --files=gs://iw-hail-anderson-strokes-test/000-hail/strokes_sample_map_test.tsv \
    --py-files=/cromwell_root/tmp.f943c555/pyscripts_fn3j9zcj.zip \
    --properties= \
    -- \
    -g \
    gs://iw-hail-anderson-strokes-test/000-hail/header.g.vcf.gz \
    -s \
    gs://iw-hail-anderson-strokes-test/000-hail/strokes_sample_map_test.tsv \
    -c \
    gs://iw-hail-anderson-strokes-test/000-hail/andersoncallset.mt \
    -t \
    gs://iw-hail-anderson-strokes-test/000-hail//tmp_20210715202734 \
    -o
Traceback (most recent call last):
  File "/usr/local/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/usr/local/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/usr/local/lib/python3.6/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/usr/local/lib/python3.6/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', '/test_combiner.py', '--cluster=test', '--files=gs://iw-hail-anderson-strokes-test/000-hail/strokes_sample_map_test.tsv', '--py-files=/cromwell_root/tmp.f943c555/pyscripts_fn3j9zcj.zip', '--properties=', '--', '-g', 'gs://iw-hail-anderson-strokes-test/000-hail/header.g.vcf.gz', '-s', 'gs://iw-hail-anderson-strokes-test/000-hail/strokes_sample_map_test.tsv', '-c', 'gs://iw-hail-anderson-strokes-test/000-hail/andersoncallset.mt', '-t', 'gs://iw-hail-anderson-strokes-test/000-hail//tmp_20210715202734', '-o']' returned non-zero exit status 1.
2021/07/15 20:28:14 Starting delocalization.
2021/07/15 20:28:15 Delocalization script execution started...
2021/07/15 20:28:15 Delocalizing output /cromwell_root/memory_retry_rc -> gs://fc-ea8c20a8-36cb-48df-a51d-f96205adc39b/ee570eb1-9589-4f88-b2d9-ee51b7a78c3c/CallsetWithWdl/ccd2f002-8b91-4d9a-874b-d8cc3823c546/call-CreateMatrixTable/memory_retry_rc
2021/07/15 20:28:16 Delocalizing output /cromwell_root/rc -> gs://fc-ea8c20a8-36cb-48df-a51d-f96205adc39b/ee570eb1-9589-4f88-b2d9-ee51b7a78c3c/CallsetWithWdl/ccd2f002-8b91-4d9a-874b-d8cc3823c546/call-CreateMatrixTable/rc
2021/07/15 20:28:17 Delocalizing output /cromwell_root/stdout -> gs://fc-ea8c20a8-36cb-48df-a51d-f96205adc39b/ee570eb1-9589-4f88-b2d9-ee51b7a78c3c/CallsetWithWdl/ccd2f002-8b91-4d9a-874b-d8cc3823c546/call-CreateMatrixTable/stdout
2021/07/15 20:28:19 Delocalizing output /cromwell_root/stderr -> gs://fc-ea8c20a8-36cb-48df-a51d-f96205adc39b/ee570eb1-9589-4f88-b2d9-ee51b7a78c3c/CallsetWithWdl/ccd2f002-8b91-4d9a-874b-d8cc3823c546/call-CreateMatrixTable/stderr
2021/07/15 20:28:20 Delocalization script execution complete.
2021/07/15 20:28:22 Done delocalization.

I am running this through the Broad’s Terra platform.

This means running in a notebook using the Terra notebook runtime, right? Using a Dataproc cluster as the execution system?

@chrisvittal can you take over from here? I don’t have any good ideas at the moment.

I’m running through the WDL/cromwell wrokflow route and not the dedicated notebook environment but I think they’re the same thing. The workflow relies on a dataproc cluster, yes.

Can you run anything in that runtime? Like how about this script:

mt = hl.balding_nichols_model(n_populations=3, n_samples=1000, n_variants=1000, n_partitions=64)
mt = hl.variant_qc(mt)
mt = hl.sample_qc(mt)
mt._force_count_rows()

Yes, that script appears to run fine as far as I can tell

My issue appeared similar to the issue posted here: Cromwell Retry with More Memory feature false failures – Terra Support
The poster’s fix of using --driver-log-levels root=WARN in the gcloud dataproc submit call did not work for me.

Running the same code in a Terra Notebook also generated the same error message.

If possible, can you just submit your script to a hail cluster (created directly with hailctl dataproc) and see if that is any different?

I’ve tried submitting directly to a hail cluster as well as seeing if the problem was with my VCFs by trying all pair combinations of the five VCFs I was using and am still running into the same error as above.

to confirm – you started a cluster with hailctl dataproc start, then submitted the script to that cluster, and it crashed with the same message?

yes

Great, thanks. Er, one more question - what version of hailctl/Hail did you use for that?

0.2.74-0c3a74d12093

Latest release, awesome. @cdv maybe next step is connecting with Isaac to watch the Spark UI when the job fails, grabbing executor logs and such?