Error Indexing BGEN files


#1

We are getting an error when trying to index BGEN files. For example, when indexing UKB chr22 we get a Java Null Pointer exception (first set of output below).

To simplify, a small BGEN file (example.8bits.bgen) from Gavin Band’s bgen repo was downloaded and tested. It produced an Assertion Error (second set of output below).

We are able to successfully import UKB chr22 in Plink format (from S3) and write the corresponding MT out to HDFS as a smoke test.

All three tests were run using spark-submit on an AWS EMR cluster. Hail is compiled on the cluster and deployed on Spark 2.3.0. The build command is:

./gradlew -Dspark.version=2.3.0 -Dbreeze.version=0.13.2 -Dpy4j.version=0.10.6 shadowJar archiveZip

Cluster is running Python 3.6.7.

In a completely separate test, a colleague followed the recently released hms-dbmi deployment process and had the same issue - Null Pointer exception error when indexing chr21 in that case (importing chr21 in Plink format and writing out to MT was also successful). Those tests were performed using the Juptyer notebook.

Thanks!
Dave

--------- Output from indexing chr22 ----------------------------------------------------------------------
Initializing Spark and Hail with default parameters…
Running on Apache Spark version 2.3.0
SparkUI available at http://ip-10-100-113-69
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.7-18bcc2aacbde
LOGGING: writing to /home/dficenec/hail-20190115-0122-0.2.7-18bcc2aacbde.log
[Stage 1:> (0 + 1) / 1]Traceback (most recent call last):
File “/home/dficenec/hail-bgen-index.py”, line 9, in
hl.index_bgen(files[1])
File “”, line 2, in index_bgen
File “/home/hadoop/hail-python.zip/hail/typecheck/check.py”, line 560, in wrapper
File “/home/hadoop/hail-python.zip/hail/methods/impex.py”, line 2029, in index_bgen
File “/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py”, line 1160, in call
File “/home/hadoop/hail-python.zip/hail/utils/java.py”, line 227, in deco
hail.utils.java.FatalError: NullPointerException: null

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-10-100-117-123…, executor 1): java.lang.NullPointerException
at is.hail.io.PackCodecSpec.buildEncoder(RowStore.scala:180)
at is.hail.io.index.IndexWriter.(IndexWriter.scala:57)
at is.hail.io.bgen.IndexBgen$$anonfun$apply$3.apply(IndexBgen.scala:103)
at is.hail.io.bgen.IndexBgen$$anonfun$apply$3.apply(IndexBgen.scala:100)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1750)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1738)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1737)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1737)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1971)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1920)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1909)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at is.hail.io.bgen.IndexBgen$.apply(IndexBgen.scala:100)
at is.hail.HailContext.indexBgen(HailContext.scala:384)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

java.lang.NullPointerException: null
at is.hail.io.PackCodecSpec.buildEncoder(RowStore.scala:180)
at is.hail.io.index.IndexWriter.(IndexWriter.scala:57)
at is.hail.io.bgen.IndexBgen$$anonfun$apply$3.apply(IndexBgen.scala:103)
at is.hail.io.bgen.IndexBgen$$anonfun$apply$3.apply(IndexBgen.scala:100)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.7-18bcc2aacbde
Error summary: NullPointerException: null

----------------------Output from indexing example.8bits.bgen -----------------------------

Running on Apache Spark version 2.3.0
SparkUI available at http://ip-10-100-113-69
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.7-18bcc2aacbde
LOGGING: writing to /home/dficenec/hail-20190115-0108-0.2.7-18bcc2aacbde.log
Traceback (most recent call last):
File “/home/dficenec/hail-bgen-test.py”, line 7, in
hl.index_bgen("/tmp/example/example.8bits.bgen")
File “”, line 2, in index_bgen
File “/home/hadoop/hail-python.zip/hail/typecheck/check.py”, line 560, in wrapper
File “/home/hadoop/hail-python.zip/hail/methods/impex.py”, line 2029, in index_bgen
File “/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py”, line 1160, in call
File “/home/hadoop/hail-python.zip/hail/utils/java.py”, line 227, in deco
hail.utils.java.FatalError: AssertionError: assertion failed

Java stack trace:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at is.hail.io.bgen.LoadBgen$.readState(LoadBgen.scala:107)
at is.hail.io.bgen.LoadBgen$$anonfun$readState$1.apply(LoadBgen.scala:97)
at is.hail.io.bgen.LoadBgen$$anonfun$readState$1.apply(LoadBgen.scala:95)
at is.hail.utils.package$.using(package.scala:587)
at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:285)
at is.hail.io.bgen.LoadBgen$.readState(LoadBgen.scala:95)
at is.hail.io.bgen.LoadBgen$$anonfun$getFileHeaders$1.apply(LoadBgen.scala:247)
at is.hail.io.bgen.LoadBgen$$anonfun$getFileHeaders$1.apply(LoadBgen.scala:247)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at is.hail.io.bgen.LoadBgen$.getFileHeaders(LoadBgen.scala:247)
at is.hail.io.bgen.IndexBgen$.apply(IndexBgen.scala:51)
at is.hail.HailContext.indexBgen(HailContext.scala:384)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.7-18bcc2aacbde
Error summary: AssertionError: assertion failed


#2

Hi Dave,

Thank you for letting us know. I’ve created an issue for the error: https://github.com/hail-is/hail/issues/5144

Best,
Jackie


#3

Hi Dave,

Naive question, but have you tried increasing memory using the --driver-memory spark submit argument?


#4

The issue was resolved with an update from the Hail team (resolution of https://github.com/hail-is/hail/issues/5144).

But yes, we use increased driver and executor memory options on our standard Hail cluster configuration.

Thanks!