Import_vcf failure on multiple inputs

Xophmeister · June 6, 2019, 2:47pm

We’re experiencing very strange behaviour when trying to import more than one block-compressed VCF file from either S3 or local disk. This is exhibited in the Jupyter notebook, using pyspark at the command line and submitting Python scripts with spark-submit.

# This works fine
hail.import_vcf([vcf_1], force_bgz=True)

# This fails
hail.import_vcf([vcf_1, vcf_2], force_bgz=True)

Note that the failing case above manifests itself if you also try to import [vcf_1, vcf_1]. There’s nothing wrong with importing [vcf_2] on its own. Our first assumption, therefore, was it was crossing an aggregated filesize threshold with multiple VCFs, but experimentation has shown this not to be the case: no matter how big the input VCFs are, it fails if there’s more than one of them.

The errors themselves are different for both S3 and local disk. The S3 error is:

Error summary: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

…while the local error is particularly weird:

Error summary: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

It’s weird because this happens even if you try importing the same VCF twice: the file is clearly there, and successfully imported once, but it’s still saying it can’t be found the second time! I’ll post the full stack traces separately, because they’re pretty long.

This is running on an in-development Hail/Spark cluster (Hail 0.2.14-8dcb6722c72a, Spark 2.4.3, Hadoop 2.7.7) that we assume to be configured incorrectly as we’ve run this type of workflow before successfully. (The correctly configured cluster no longer exists, unfortunately.)

(n.b., This is related to SSLException: connection reset during matrixtable.write, in the sense that it’s the same cluster and the same error, in the S3 case, but for importing rather than writing.)

tpoterba · June 6, 2019, 2:48pm

what’s the full stack trace?

Xophmeister · June 6, 2019, 2:48pm

S3 stack trace:

Traceback (most recent call last):
  File "/opt/sanger.ac.uk/hgi/src/hail-import-issue/import.py", line 25, in <module>
    mt = hl.import_vcf([f"{prefix_s3}/{vcf_100kB}", f"{prefix_s3}/{vcf_5MB}"], force_bgz=True)
  File "</opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-1140>", line 2, in import_vcf
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/typecheck/check.py", line 561, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/methods/impex.py", line 1935, in import_vcf
    return MatrixTable(MatrixRead(reader, drop_cols=drop_samples))
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/matrixtable.py", line 558, in __init__
    self._type = self._mir.typ
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 158, in typ
    self._compute_type()
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/matrix_ir.py", line 40, in _compute_type
    self._type = Env.backend().matrix_type(self)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 121, in matrix_type
    jir = self._to_java_ir(mir)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 102, in _to_java_ir
    ir._jir = ir.parse(r(ir), ir_map=r.jirs)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 163, in parse
    return Env.hail().expr.ir.IRParser.parse_matrix_ir(code, ref_map, ir_map)
  File "/opt/sanger.ac.uk/hgi/spark-2.4.3-bin-hgi-hadoop2.7.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/utils/java.py", line 240, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

Java stack trace:
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.226.83, executor 21): org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
	at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
	at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:166)
	at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
	at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at com.amazonaws.services.s3.model.S3ObjectInputStream.abort(S3ObjectInputStream.java:90)
	at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:199)
	at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
	at org.apache.hadoop.io.compress.CompressionInputStream.close(CompressionInputStream.java:63)
	at org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:206)
	at is.hail.utils.package$.using(package.scala:596)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at is.hail.io.vcf.MatrixVCFReader.<init>(LoadVCF.scala:1439)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
	at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
	at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:166)
	at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
	at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at com.amazonaws.services.s3.model.S3ObjectInputStream.abort(S3ObjectInputStream.java:90)
	at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:199)
	at java.io.FilterInputStream.close(FilterInputStream.java:180)
	at org.apache.hadoop.io.compress.CompressionInputStream.close(CompressionInputStream.java:63)
	at org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:206)
	at is.hail.utils.package$.using(package.scala:596)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:834)





Hail version: 0.2.14-8dcb6722c72a
Error summary: ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 4736144; received: 98304

Local stack trace:

Traceback (most recent call last):
  File "/opt/sanger.ac.uk/hgi/src/hail-import-issue/import.py", line 25, in <module>
    mt = hl.import_vcf([f"{prefix_local}/{vcf_100kB}", f"{prefix_local}/{vcf_5MB}"], force_bgz=True)
  File "</opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-1140>", line 2, in import_vcf
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/typecheck/check.py", line 561, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/methods/impex.py", line 1935, in import_vcf
    return MatrixTable(MatrixRead(reader, drop_cols=drop_samples))
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/matrixtable.py", line 558, in __init__
    self._type = self._mir.typ
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 158, in typ
    self._compute_type()
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/matrix_ir.py", line 40, in _compute_type
    self._type = Env.backend().matrix_type(self)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 121, in matrix_type
    jir = self._to_java_ir(mir)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 102, in _to_java_ir
    ir._jir = ir.parse(r(ir), ir_map=r.jirs)
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/ir/base_ir.py", line 163, in parse
    return Env.hail().expr.ir.IRParser.parse_matrix_ir(code, ref_map, ir_map)
  File "/opt/sanger.ac.uk/hgi/spark-2.4.3-bin-hgi-hadoop2.7.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/sanger.ac.uk/hgi/anaconda3/lib/python3.7/site-packages/hail/utils/java.py", line 240, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

Java stack trace:
java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.226.143, executor 14): java.io.FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.is$hail$utils$richUtils$RichHadoopConfiguration$$open$extension(RichHadoopConfiguration.scala:38)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at is.hail.io.vcf.MatrixVCFReader.<init>(LoadVCF.scala:1439)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.json4s.reflect.Executable.invoke(Executable.scala:52)
	at org.json4s.Extraction$ClassInstanceBuilder.instantiate(Extraction.scala:554)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:597)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$ClassInstanceBuilder.mkWithTypeHint(Extraction.scala:587)
	at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:596)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:400)
	at org.json4s.Extraction$$anonfun$extract$6.apply(Extraction.scala:392)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:606)
	at org.json4s.Extraction$.extract(Extraction.scala:392)
	at org.json4s.Extraction$.extract(Extraction.scala:39)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:50)
	at org.json4s.Serialization$class.read(Serialization.scala:30)
	at org.json4s.jackson.Serialization$.read(Serialization.scala:17)
	at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1117)
	at is.hail.expr.ir.IRParser$.matrix_ir(Parser.scala:1053)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$$anonfun$parse_matrix_ir$2.apply(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:1253)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1269)
	at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1268)
	at is.hail.expr.ir.IRParser.parse_matrix_ir(Parser.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.io.FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.is$hail$utils$richUtils$RichHadoopConfiguration$$open$extension(RichHadoopConfiguration.scala:38)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readFile$extension(RichHadoopConfiguration.scala:293)
	at is.hail.utils.richUtils.RichHadoopConfiguration$.readLines$extension(RichHadoopConfiguration.scala:300)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1190)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1442)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$13.apply(LoadVCF.scala:1439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:834)





Hail version: 0.2.14-8dcb6722c72a
Error summary: FileNotFoundException: File file:/opt/sanger.ac.uk/hgi/src/hail-import-issue/chrY_10316945-10544039_all.g.vcf.gz does not exist

tpoterba · June 6, 2019, 2:50pm

Also, if you run:

mt = hail.import_vcf([vcf_1], force_bgz=True)
mt._force_count_rows()

does it succeed?

I think the problem here is that the files are on a file system visible to the driver machine, not the workers. The header info for the first file is read locally; the rest are parallelized. Then all computations on the full data are done in parallel, so I’d expect the above to fail.

Xophmeister · June 6, 2019, 2:53pm

That does indeed fail for the local file, but it works fine with the VCF loaded from S3. So, yes, I suspect you’re right about the local file not being distributed to the workers – so that’s a red herring – but the S3 issue is odd…

tpoterba · June 6, 2019, 4:54pm

ah, the s3 error. I haven’t seen anything like that before.

Can you import and force-count the files individually? If it’s not actually bgzipped that might be one source of errors.

Xophmeister · June 7, 2019, 9:35am

The VCFs are genuinely block-compressed, but they’re not appropriately named. Regarding local storage, to test your theory, I tried using cluster-shared local storage and it did indeed work: so the non-distributive nature of the files was the problem in the local case… On to the more troublesome S3 case!

I tried it sequentially, per your suggestion:

mt1 = hail.import_vcf([s3_vcf_1], force_bgz=True)
mt1._force_count_rows()

mt2 = hail.import_vcf([s3_vcf_2], force_bgz=True)
mt2._force_count_rows()

…and it worked fine. If I alter any of those above lists to have multiple elements, it again starts to fail with the same error.

tpoterba · June 7, 2019, 12:23pm

aha - it looks like maybe force_bgz is broken for multiple files…

tpoterba · June 7, 2019, 12:31pm

wait, no – bgzip files are valid gzip files, so my idea isn’t correct.

Xophmeister · June 7, 2019, 1:40pm

…also, force_bgz is working fine with multiple local files and has been demonstrated to work with multiple S3 files on our previous cluster.

tpoterba · June 7, 2019, 1:53pm

similar:

Vivek · June 11, 2019, 2:12pm

We are getting people systems to also look at this problem (it is not sporadic: it’s systematically happening any time we import >1 vcf.gz file from s3 in parallel). They are asking if they can see the source code: they are wondering if there’s a branch there for >1 file which could yield insight. I can see the code starts in python, and presumably goes into scala at some point. Can anyone point at the modules we should look at first, or is this barking up the wrong tree?

Vivek · June 11, 2019, 3:47pm

Well (answering my own question)
THis:
methods/impex.py: _cached_importvcfs = Env.hail().io.vcf.ImportVCFs
Leads to this:
scala/is/hail/io/vcf/LoadVCF.scala

but no real gain of insight there …

tpoterba · June 11, 2019, 5:12pm

This piece is the bit that’s different for file 1 and files 2…N:

github.com

hail-is/hail/blob/fa061203136e22683abc619ae7cc0bec2e1ecb9d/hail/src/main/scala/is/hail/io/vcf/LoadVCF.scala#L1442




if (headerFile.isEmpty) {
  val confBc = HailContext.hadoopConfBc
  val header1Bc = hc.backend.broadcast(header1)


  val localCallFields = callFields
  val localFloatType = entryFloatType
  val localInputs = inputs
  val localArrayElementsRequired = arrayElementsRequired
  val localFilterAndReplace = filterAndReplace
  sc.parallelize(inputs.tail, math.max(1, inputs.length - 1)).foreach { file =>
    val hConf = confBc.value.value
    val hd = parseHeader(
      localCallFields, localFloatType, getHeaderLines(hConf, file, localFilterAndReplace),
      arrayElementsRequired = localArrayElementsRequired)
    val hd1 = header1Bc.value


    if (hd1.sampleIds.length != hd.sampleIds.length) {
      fatal(
        s"""invalid sample IDs: expected same number of samples for all inputs.
           | ${ localInputs(0) } has ${ hd1.sampleIds.length } ids and

We parallelize the files.tail (all but the first) to scan for headers. The difference in this code vs the regular data path is that these files will be read as .gz files, not .bgz, which is usually fine. But if your hadoop installation somehow has an invalid gzip reader, this could throw strange errors?

If you rename to .bgz, does it work?

Xophmeister · June 17, 2019, 3:22pm

If you rename the files to end in .bgz, the failure mode changes.

With a non-.bgz extension and force_bgz=True, import_vcf will consistently fail for lists of S3 files greater than one in length (as documented above.) The failing case occurs whether the list of files is homogeneous (e.g., [file, file]) or heterogeneous (e.g. [file1, file2]), regardless of chromosome.

With a .bgz extension and not specifying force_bgz (so the default of False), the behaviour of import_vcf is:

A list of one, S3-hosted VCF: Works
Homogeneous list of S3-hosted VCFs:
- From chromosome X: Fails
- Without chrX: Works
Heterogeneous list of S3-hosted VCFs:
- Including chrX: Fails
- Without chrX: Works

The condition on chromosome X may be a red herring, as my testing data set it quite limited; e.g., I don’t know if it affects other chromosomes (I’m testing with intervals from chr2, 7, 10, 14, 16, 18, 21 and X), or different intervals in chrX. If I import just a list of one chrX VCF, it works fine.

Unless there is something obvious – I’m not a geneticist, but if there’s some biological reason that stands out why chrX might be different – I will try with other chromosomes and different intervals from chrX to see if I can gather more evidence of a pattern…

tpoterba · June 17, 2019, 6:48pm

what’s the error message for the chrX failure? The same one?

tpoterba · June 17, 2019, 6:48pm

This is totally baffling.

Xophmeister · June 21, 2019, 3:28pm

I think the ChrX failure is just a red herring…

I’ve just tried importing 10 VCFs per chromosome, with a .bgz extension, from S3. More than half the chromosomes work, but the VCFs from Chr1, 10, 14, 15, 16, 19, 21, 22, X and Y fail in the same way as described above. As far as I know, there’s nothing special about this set of chromosomes, so I suspect it’s more likely to be something to do with those particular files. I will try random subsets thereof to see if I can narrow it down…

That said, I should note that these very same VCFs have correctly been imported into Hail on a previous cluster, so I’m doubtful that it’s got anything to do with the files themselves. Another thing I can try is importing more VCFs for the chromosomes that did work, to see if I can force it to fail.

It’s baffling to all of us! I don’t see any deterministic failure pattern

tpoterba · June 21, 2019, 3:32pm

What Hail version was the previous cluster using? Same version?

Could be something related to the openstack file abstractions, I guess…

Xophmeister · June 21, 2019, 3:49pm

I’m not 100% convinced about this – I was certain the previous cluster was running Hail 0.2 – but the Git repo suggests 0.1-0bd1988e, on Spark 2.1.2 with Hadoop 2.8.2. Our new provisioning is Hail 0.2.14-8dcb6722c72a, on Spark 2.4.3 with Hadoop 2.7.7.

Take that with a grain of salt, because our previous provisioning is so complex that no one can really follow it. However, if it was 0.1, have the internals of the import changed significantly such that it could account for this in 0.2?

Topic		Replies	Views
Import_vcf() report error Hail Query & hailctl	4	377	October 19, 2022
Help for finding rare variants for 100 patients Hail Query & hailctl	10	441	March 20, 2023
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024
I cannot import the UKB 200K WGS VCFs in Hail due to an empty line in the VCF after the header lines Hail Query & hailctl	2	559	July 7, 2022
HAIL 0.1: export vcf hadoop error Help [0.1]	7	1374	January 28, 2019

Import_vcf failure on multiple inputs

Related topics