ArrayIndexOutOfBoundsException with run_combiner

Wanted to try merging a couple of gVCFs to try the experimental combiner and got the error below. Konrad suggested it may be a bug?
combiner.log (461.8 KB)
Attaching the log file.

import hail as hl

hl.init(log=‘/home/hail/combiner.log’)

gvcf_list = [“gs://maze-data-external-private/answer-als/genomics/3_vcf/CASE-NEUCF538BRM/CASE-NEUCF538BRM-00610/CASE-NEUCF538BRM-00610-G_1.haplotypeCalls.er.raw.vcf.gz”,
“gs://maze-data-external-private/answer-als/genomics/3_vcf/CASE-NEUAE228FF6/CASE-NEUAE228FF6-02221/CASE-NEUAE228FF6-02221-G_1.haplotypeCalls.er.raw.vcf.gz”]

output_file = ‘gs://human-genetics-berylc/ALS/answer_als_hail/tmp/output.mt’ # output destination
temp_bucket = ‘gs://human-genetics-berylc/ALS/answer_als_hail/tmp/’ # bucket for storing intermediate files
hl.experimental.run_combiner(gvcf_list, out_file=output_file, tmp_path=temp_bucket, reference_genome=‘GRCh38’,use_genome_default_intervals = True)

Running on Apache Spark version 2.4.5
SparkUI available at http://answer-m.us-central1-b.c.human-genetics-001.internal:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.64-1ef70187dc78
LOGGING: writing to /home/hail/combiner.log
2021-04-09 20:37:07 Hail: INFO: Using 2586 intervals with default whole-genome size 1200000 as partitioning for GVCF import
2021-04-09 20:37:07 Hail: INFO: GVCF combiner plan:
Branch factor: 100
Batch size: 100
Combining 2 input files in 1 phases with 1 total jobs.
Phase 1: 1 job corresponding to 1 final output file.

2021-04-09 20:37:07 Hail: INFO: Starting phase 1/1, merging 2 input GVCFs in 1 job.
2021-04-09 20:37:07 Hail: INFO: Starting phase 1/1, job 1/1 to create 1 merged file, corresponding to ~100.0% of total I/O.

FatalError Traceback (most recent call last)
in
9 output_file = ‘gs://human-genetics-berylc/ALS/answer_als_hail/tmp/output.mt’ # output destination
10 temp_bucket = ‘gs://human-genetics-berylc/ALS/answer_als_hail/tmp/’ # bucket for storing intermediate files
—> 11 hl.experimental.run_combiner(vcf_list, out_file=output_file, tmp_path=temp_bucket, reference_genome=‘GRCh38’,use_genome_default_intervals = True)

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py in run_combiner(sample_paths, out_file, tmp_path, intervals, import_interval_size, use_genome_default_intervals, use_exome_default_intervals, header, sample_names, branch_factor, batch_size, target_records, overwrite, reference_genome, contig_recoding, key_by_locus_and_alleles)
679 if key_by_locus_and_alleles:
680 final_mt = MatrixTable(MatrixKeyRowsBy(final_mt._mir, [‘locus’, ‘alleles’], is_sorted=True))
→ 681 final_mt.write(out_file, overwrite=overwrite)
682 new_files_to_merge = [out_file]
683 info(f"Finished phase {phase_i}/{n_phases}, job {job_i}/{len(phase.jobs)}, 100% of total I/O finished.")

in write(self, output, overwrite, stage_locally, _codec_spec, _partitions)

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions)
2526
2527 writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type)
→ 2528 Env.backend().execute(ir.MatrixWrite(self._mir, writer))
2529
2530 class _Show:

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
96 raise HailUserError(message_and_trace) from None
97
—> 98 raise e

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
72 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
73 try:
—> 74 result = json.loads(self._jhc.backend().executeJSON(jir))
75 value = ir.typ._from_json(result[‘value’])
76 timings = result[‘timings’]

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
→ 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/opt/conda/miniconda3/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
30 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’
31 ‘Hail version: %s\n’
—> 32 ‘Error summary: %s’ % (deepest, full, hail.version, deepest), error_id) from None
33 except pyspark.sql.utils.CapturedException as e:
34 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’

FatalError: ArrayIndexOutOfBoundsException: null

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 20 times, most recent failure: Lost task 9.19 in stage 1.0 (TID 175, answer-w-1.us-central1-b.c.human-genetics-001.internal, executor 1): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1892)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1880)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2113)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2062)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2051)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:953)
at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:246)
at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:62)
at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:41)
at is.hail.expr.ir.Interpret$.run(Interpret.scala:819)
at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:362)
at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:346)
at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:343)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:48)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:48)
at is.hail.utils.package$.using(package.scala:618)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:48)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:618)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:47)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:256)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:343)
at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:387)
at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:385)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:385)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

java.lang.ArrayIndexOutOfBoundsException: null
at

Hail version: 0.2.64-1ef70187dc78
Error summary: ArrayIndexOutOfBoundsException: null

Wanted to mention that just in the off chance that the failure was due to too few gVCFs, I tried this with the full dataset and got the same error. ALS_answer_als_hail_tmp_combiner.log (3.8 MB)

@chrisvittal can you take this? Looks like the real error is in tabix:

2021-04-09 20:37:20 TaskSetManager: WARN: Lost task 4.0 in stage 1.0 (TID 6, answer-w-1.us-central1-b.c.human-genetics-001.internal, executor 1): java.lang.ArrayIndexOutOfBoundsException: 3366
	at is.hail.io.tabix.TabixReader$$anonfun$1.apply(TabixReader.scala:116)
	at is.hail.io.tabix.TabixReader$$anonfun$1.apply(TabixReader.scala:86)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.io.tabix.TabixReader.<init>(TabixReader.scala:86)
	at is.hail.io.vcf.PartitionedVCFRDD.compute(LoadVCF.scala:1486)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)

just wanted to bump this. cc @chrisvittal

@tpoterba @chrisvittal any luck?

From the look of it, the issue may be in the tabix files themselves. A well formed tabix file should not be able to hit the error you did. I don’t know of any tool to ensure tabix validity but this is essentially an assertion that we couldn’t parse your tabix indices.

The best solution I have is to re-tabix your gVCFs.