Import_vcf failure on multiple inputs

We’re experiencing very strange behaviour when trying to import more than one block-compressed VCF file from either S3 or local disk. This is exhibited in the Jupyter notebook, using pyspark at the command line and submitting Python scripts with spark-submit.

# This works fine
hail.import_vcf([vcf_1], force_bgz=True)

# This fails
hail.import_vcf([vcf_1, vcf_2], force_bgz=True)

Note that the failing case above manifests itself if you also try to import [vcf_1, vcf_1]. There’s nothing wrong with importing [vcf_2] on its own. Our first assumption, therefore, was it was crossing an aggregated filesize threshold with multiple VCFs, but experimentation has shown this not to be the case: no matter how big the input VCFs are, it fails if there’s more than one of them.

The errors themselves are different for both S3 and local disk. The S3 error is:

Error summary: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

…while the local error is particularly weird:

Error summary: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

It’s weird because this happens even if you try importing the same VCF twice: the file is clearly there, and successfully imported once, but it’s still saying it can’t be found the second time! I’ll post the full stack traces separately, because they’re pretty long.

This is running on an in-development Hail/Spark cluster (Hail 0.2.14-8dcb6722c72a, Spark 2.4.3, Hadoop 2.7.7) that we assume to be configured incorrectly as we’ve run this type of workflow before successfully. (The correctly configured cluster no longer exists, unfortunately.)

(n.b., This is related to SSLException: connection reset during matrixtable.write, in the sense that it’s the same cluster and the same error, in the S3 case, but for importing rather than writing.)

what’s the full stack trace?

S3 stack trace:

Traceback (most recent call last):
  File "/opt/sanger.ac.uk/hgi/src/hail-import-issue/import.py", line 25, in <module>
    mt = hl.import_vcf([f"{prefix_s3}/{vcf_100kB}", f"{prefix_s3}/{vcf_5MB}"], force_bgz=True)
  File "</opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-1140>", line 2, in import_vcf
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/typecheck/check.py", line 561, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/methods/impex.py", line 1935, in import_vcf
    return MatrixTable(MatrixRead(reader, drop_cols=drop_samples))
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/matrixtable.py", line 558, in __init__
    self._type = self._mir.typ
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 158, in typ
    self._compute_type()
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/matrix_ir.py", line 40, in _compute_type
    self._type = Env.backend().matrix_type(self)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 121, in matrix_type
    jir = self._to_java_ir(mir)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 102, in _to_java_ir
    ir._jir = ir.parse(r(ir), ir_map=r.jirs)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 163, in parse
    return Env.hail().expr.ir.IRParser.parse_matrix_ir(code, ref_map, ir_map)
  File "/opt/sanger.ac.uk/hgi/spark-2.4.3-bin-hgi-hadoop2.7.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/utils/java.py", line 240, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

Java stack trace:
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.226.83, executor 21): org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
	at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
	at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:166)
	at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
	at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at com.amazonaws.services.s3.model.S3ObjectInputStream.abort(S3ObjectInputStream.java:90)
	at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:199)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at org.apache.hadoop.io.compress.CompressionInputStream.close(CompressionInputStream.java:63)
	at org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:206)
	at is.hail.utils.package$.using(package.scala:596)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at is.hail.io.vcf.MatrixVCFReader.<init>(LoadVCF.scala:1439)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
	at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
	at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:166)
	at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
	at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at com.amazonaws.services.s3.model.S3ObjectInputStream.abort(S3ObjectInputStream.java:90)
	at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:199)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at org.apache.hadoop.io.compress.CompressionInputStream.close(CompressionInputStream.java:63)
	at org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:206)
	at is.hail.utils.package$.using(package.scala:596)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:834)





Hail version: 0.2.14-8dcb6722c72a
Error summary: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

Local stack trace:

Traceback (most recent call last):
  File "/opt/sanger.ac.uk/hgi/src/hail-import-issue/import.py", line 25, in <module>
    mt = hl.import_vcf([f"{prefix_local}/{vcf_100kB}", f"{prefix_local}/{vcf_5MB}"], force_bgz=True)
  File "</opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-1140>", line 2, in import_vcf
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/typecheck/check.py", line 561, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/methods/impex.py", line 1935, in import_vcf
    return MatrixTable(MatrixRead(reader, drop_cols=drop_samples))
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/matrixtable.py", line 558, in __init__
    self._type = self._mir.typ
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 158, in typ
    self._compute_type()
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/matrix_ir.py", line 40, in _compute_type
    self._type = Env.backend().matrix_type(self)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 121, in matrix_type
    jir = self._to_java_ir(mir)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 102, in _to_java_ir
    ir._jir = ir.parse(r(ir), ir_map=r.jirs)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 163, in parse
    return Env.hail().expr.ir.IRParser.parse_matrix_ir(code, ref_map, ir_map)
  File "/opt/sanger.ac.uk/hgi/spark-2.4.3-bin-hgi-hadoop2.7.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/utils/java.py", line 240, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

Java stack trace:
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.226.143, executor 14): java.io.FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.is$hail$utils$richUtils$RichHadoopConfiguration$$open$extension(RichHadoopConfiguration.scala:38)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at is.hail.io.vcf.MatrixVCFReader.<init>(LoadVCF.scala:1439)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.io.FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.is$hail$utils$richUtils$RichHadoopConfiguration$$open$extension(RichHadoopConfiguration.scala:38)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:834)





Hail version: 0.2.14-8dcb6722c72a
Error summary: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

Also, if you run:

mt = hail.import_vcf([vcf_1], force_bgz=True)
mt._force_count_rows()

does it succeed?

I think the problem here is that the files are on a file system visible to the driver machine, not the workers. The header info for the first file is read locally; the rest are parallelized. Then all computations on the full data are done in parallel, so I’d expect the above to fail.

That does indeed fail for the local file, but it works fine with the VCF loaded from S3. So, yes, I suspect you’re right about the local file not being distributed to the workers – so that’s a red herring – but the S3 issue is odd…

ah, the s3 error. I haven’t seen anything like that before.

Can you import and force-count the files individually? If it’s not actually bgzipped that might be one source of errors.

The VCFs are genuinely block-compressed, but they’re not appropriately named. Regarding local storage, to test your theory, I tried using cluster-shared local storage and it did indeed work: so the non-distributive nature of the files was the problem in the local case… On to the more troublesome S3 case!

I tried it sequentially, per your suggestion:

mt1 = hail.import_vcf([s3_vcf_1], force_bgz=True)
mt1._force_count_rows()

mt2 = hail.import_vcf([s3_vcf_2], force_bgz=True)
mt2._force_count_rows()

…and it worked fine. If I alter any of those above lists to have multiple elements, it again starts to fail with the same error.

aha - it looks like maybe force_bgz is broken for multiple files…

wait, no – bgzip files are valid gzip files, so my idea isn’t correct.

…also, force_bgz is working fine with multiple local files and has been demonstrated to work with multiple S3 files on our previous cluster.

similar:

We are getting people systems to also look at this problem (it is not sporadic: it’s systematically happening any time we import >1 vcf.gz file from s3 in parallel). They are asking if they can see the source code: they are wondering if there’s a branch there for >1 file which could yield insight. I can see the code starts in python, and presumably goes into scala at some point. Can anyone point at the modules we should look at first, or is this barking up the wrong tree?

Well (answering my own question)
THis:
methods/impex.py: _cached_importvcfs = Env.hail().io.vcf.ImportVCFs
Leads to this:
scala/is/hail/io/vcf/LoadVCF.scala

  • but no real gain of insight there …

This piece is the bit that’s different for file 1 and files 2…N:

We parallelize the files.tail (all but the first) to scan for headers. The difference in this code vs the regular data path is that these files will be read as .gz files, not .bgz, which is usually fine. But if your hadoop installation somehow has an invalid gzip reader, this could throw strange errors?

If you rename to .bgz, does it work?

If you rename the files to end in .bgz, the failure mode changes.

With a non-.bgz extension and force_bgz=True, import_vcf will consistently fail for lists of S3 files greater than one in length (as documented above.) The failing case occurs whether the list of files is homogeneous (e.g., [file, file]) or heterogeneous (e.g. [file1, file2]), regardless of chromosome.

With a .bgz extension and not specifying force_bgz (so the default of False), the behaviour of import_vcf is:

  • A list of one, S3-hosted VCF: Works
  • Homogeneous list of S3-hosted VCFs:
    • From chromosome X: Fails
    • Without chrX: Works
  • Heterogeneous list of S3-hosted VCFs:
    • Including chrX: Fails
    • Without chrX: Works

The condition on chromosome X may be a red herring, as my testing data set it quite limited; e.g., I don’t know if it affects other chromosomes (I’m testing with intervals from chr2, 7, 10, 14, 16, 18, 21 and X), or different intervals in chrX. If I import just a list of one chrX VCF, it works fine.

Unless there is something obvious – I’m not a geneticist, but if there’s some biological reason that stands out why chrX might be different – I will try with other chromosomes and different intervals from chrX to see if I can gather more evidence of a pattern…

what’s the error message for the chrX failure? The same one?

This is totally baffling.

I think the ChrX failure is just a red herring…

I’ve just tried importing 10 VCFs per chromosome, with a .bgz extension, from S3. More than half the chromosomes work, but the VCFs from Chr1, 10, 14, 15, 16, 19, 21, 22, X and Y fail in the same way as described above. As far as I know, there’s nothing special about this set of chromosomes, so I suspect it’s more likely to be something to do with those particular files. I will try random subsets thereof to see if I can narrow it down…

That said, I should note that these very same VCFs have correctly been imported into Hail on a previous cluster, so I’m doubtful that it’s got anything to do with the files themselves. Another thing I can try is importing more VCFs for the chromosomes that did work, to see if I can force it to fail.

It’s baffling to all of us! I don’t see any deterministic failure pattern :confused:

What Hail version was the previous cluster using? Same version?

Could be something related to the openstack file abstractions, I guess…

I’m not 100% convinced about this – I was certain the previous cluster was running Hail 0.2 – but the Git repo suggests 0.1-0bd1988e, on Spark 2.1.2 with Hadoop 2.8.2. Our new provisioning is Hail 0.2.14-8dcb6722c72a, on Spark 2.4.3 with Hadoop 2.7.7.

Take that with a grain of salt, because our previous provisioning is so complex that no one can really follow it. However, if it was 0.1, have the internals of the import changed significantly such that it could account for this in 0.2?