Potential liftover issue: Error summary: NoSuchElementException: key not found: GRCh37

I have a script I’m running on Google dataproc with using hail (v0.2.73-2ab84582d24d). The script does some simple filtering and processing and also lifts over coordinates from GRCh38 to 37. When I call the script on small datasets (<1M), it runs well as expected, however runs with larger datasets fail with the following error:

Running on Apache Spark version 3.1.1
SparkUI available at http://hail-single-m.europe-west1-d.c.open-targets-genetics-dev.internal:34599
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.73-2ab84582d24d
LOGGING: writing to /home/hail/hail-20210729-1209-0.2.73-2ab84582d24d.log
2021-07-29 12:10:37 INFO generate_variant_annotation - main: Variants post-MAF filter: 1019107
Traceback (most recent call last):                             (39 + 58) / 1409]
  File "/tmp/5ba552cf84f64c8484ef9f907e2ee005/generate_variant_annotation.py", line 192, in <module>
    main(gnomad_file, chain_file, maf_threshold, out_folder, args.test)
  File "/tmp/5ba552cf84f64c8484ef9f907e2ee005/generate_variant_annotation.py", line 133, in main
    ht
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'           (39 + 37) / 1409]
hail.utils.java.FatalError: NoSuchElementException: key not found: GRCh37

Java stack trace:
org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 27 in stage 2.0 failed 20 times, most recent failure: Lost task 27.19 in stage 2.0 (TID 4324) (hail-single-m.europe-west1-d.c.open-targets-genetics-dev.internal executor 6): java.util.NoSuchElementException: key not found: GRCh37
	at scala.collection.MapLike.default(MapLike.scala:235)
	at scala.collection.MapLike.default$(MapLike.scala:234)
	at scala.collection.AbstractMap.default(Map.scala:65)
	at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
	at is.hail.variant.ReferenceGenome.getLiftover(ReferenceGenome.scala:412)
	at is.hail.variant.ReferenceGenome.liftoverLocus(ReferenceGenome.scala:423)
	at __C700Compiled.applyregion0_8(Emit.scala)
	at __C700Compiled.apply(Emit.scala)
	at is.hail.expr.ir.TableMapRows.$anonfun$execute$43(TableIR.scala:1936)
	at scala.runtime.java8.JFunction1$mcJJ$sp.apply(JFunction1$mcJJ$sp.java:23)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.next(RichContextRDD.scala:79)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:343)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2254)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2202)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2441)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2383)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2372)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

java.util.NoSuchElementException: key not found: GRCh37
	at scala.collection.MapLike.default(MapLike.scala:235)
	at scala.collection.MapLike.default$(MapLike.scala:234)
	at scala.collection.AbstractMap.default(Map.scala:65)
	at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
	at is.hail.variant.ReferenceGenome.getLiftover(ReferenceGenome.scala:412)
	at is.hail.variant.ReferenceGenome.liftoverLocus(ReferenceGenome.scala:423)
	at __C700Compiled.applyregion0_8(Emit.scala)
	at __C700Compiled.apply(Emit.scala)
	at is.hail.expr.ir.TableMapRows.$anonfun$execute$43(TableIR.scala:1936)
	at scala.runtime.java8.JFunction1$mcJJ$sp.apply(JFunction1$mcJJ$sp.java:23)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.next(RichContextRDD.scala:79)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:343)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)





Hail version: 0.2.73-2ab84582d24d
Error summary: NoSuchElementException: key not found: GRCh37
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [5ba552cf84f64c8484ef9f907e2ee005] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/5ba552cf84f64c8484ef9f907e2ee005?project=open-targets-genetics-dev&region=europe-west1
gcloud dataproc jobs wait '5ba552cf84f64c8484ef9f907e2ee005' --region 'europe-west1' --project 'open-targets-genetics-dev'
https://console.cloud.google.com/storage/browser/dataproc-staging-europe-west1-234703259993-hrobeqyg/google-cloud-dataproc-metainfo/a98a9d69-97d3-4fb9-b48d-d911b63089d2/jobs/5ba552cf84f64c8484ef9f907e2ee005/
gs://dataproc-staging-europe-west1-234703259993-hrobeqyg/google-cloud-dataproc-metainfo/a98a9d69-97d3-4fb9-b48d-d911b63089d2/jobs/5ba552cf84f64c8484ef9f907e2ee005/driveroutput

When the liftover step is removed, the script doesn’t fail. This suggests there might be some problem with some liftover, maybe related to how data is broadcasted. I would highly appreciate any feedback, tips I could use to overcome this issue.

We recently fixed a bug in how reference genomes were serialized, but our fix clearly isn’t working as intended. Will investigate…

Tracking issue here:

1 Like

Thank you very much for the quick reply!

Can you help me replicate this? What’s an example set of command-line args to trigger the issue (which I assume happens in the parquet write step from the stack trace)?