Import_vcf() report error

Hi,

I used import_vcf() with parameter header_file, and force=True to import over 20,000 vcf.gz files into one big MT file, with Google Dataproc.

I got this error:

Hail version: 0.2.96-39909e0a396f
Error summary: IOException: Gzip-compressed data is corrupt.

Is it possible that GCP or Hail setting could cause this? Or it is only due to my input data is corrupt. So I basically can do nothing about this issue?

Any help would be greatly appreciated, Shuang.

What’s the full stack trace?

You can verify gzip file integrity with gzip -t.

Hi @danking, thanks a for your reply and here is my full stack trace:

Traceback (most recent call last):
  File "/tmp/c2f6a53decda46f7953b327d4fbc1b39/step1_07102022.py", line 4, in <module>
    hl.import_vcf('gs://path/WGS*.vcf.gz', header_file='gs://path/WGS_header.txt', force=True).write('gs://path/shuang/step1/wgs.mt', overwrite=True)
  File "<decorator-gen-1162>", line 2, in write
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/matrixtable.py", line 2558, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: IOException: Gzip-compressed data is corrupt

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 615 in stage 0.0 failed 20 times, most recent failure: Lost task 615.19 in stage 0.0 (TID 18002) (hail-07102022-sw-v858.internal executor 3): java.io.IOException: Gzip-compressed data is corrupt
	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:316)
	at java.io.InputStream.read(InputStream.java:101)
	at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:79)
	at is.hail.expr.ir.GenericLines$$anon$1.loadBuffer(GenericLines.scala:72)
	at is.hail.expr.ir.GenericLines$$anon$1.readLine(GenericLines.scala:182)
	at is.hail.expr.ir.GenericLines$$anon$1.hasNext(GenericLines.scala:202)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at __C15collect_distributed_array.__m100split_StreamFor_region6_21(Unknown Source)
	at __C15collect_distributed_array.__m100split_StreamFor(Unknown Source)
	at __C15collect_distributed_array.__m98begin_group_0(Unknown Source)
	at __C15collect_distributed_array.__m26split_Let(Unknown Source)
	at __C15collect_distributed_array.apply(Unknown Source)
	at __C15collect_distributed_array.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$4(BackendUtils.scala:40)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$3(BackendUtils.scala:39)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:764)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2204)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2225)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2244)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:321)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:37)
	at __C3Compiled.apply(Emit.scala)
	at is.hail.expr.ir.LoweredTableReader$.makeCoercer(TableIR.scala:318)
	at is.hail.expr.ir.GenericTableValue.getLTVCoercer(GenericTableValue.scala:134)
	at is.hail.expr.ir.GenericTableValue.toTableStage(GenericTableValue.scala:159)
	at is.hail.io.vcf.MatrixVCFReader.lower(LoadVCF.scala:1798)
	at is.hail.expr.ir.lowering.LowerTableIR$.applyTable(LowerTableIR.scala:713)
	at is.hail.expr.ir.lowering.LowerTableIR$.lower$1(LowerTableIR.scala:465)
	at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:679)
	at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:73)
	at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:18)
	at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:77)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.io.IOException: Gzip-compressed data is corrupt
	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:316)
	at java.io.InputStream.read(InputStream.java:101)
	at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:79)
	at is.hail.expr.ir.GenericLines$$anon$1.loadBuffer(GenericLines.scala:72)
	at is.hail.expr.ir.GenericLines$$anon$1.readLine(GenericLines.scala:182)
	at is.hail.expr.ir.GenericLines$$anon$1.hasNext(GenericLines.scala:202)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at __C15collect_distributed_array.__m100split_StreamFor_region6_21(Unknown Source)
	at __C15collect_distributed_array.__m100split_StreamFor(Unknown Source)
	at __C15collect_distributed_array.__m98begin_group_0(Unknown Source)
	at __C15collect_distributed_array.__m26split_Let(Unknown Source)
	at __C15collect_distributed_array.apply(Unknown Source)
	at __C15collect_distributed_array.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$4(BackendUtils.scala:40)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$3(BackendUtils.scala:39)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:764)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.96-39909e0a396f
Error summary: IOException: Gzip-compressed data is corrupt

And I checked my input shards, I noticed there are around 50 vcf.gz shards only contains meta-information lines, a header line. There is no data lines! Basically, it is a header file. Does this will cause Hail report “IOException: Gzip-compressed data is corrupt” error?

No, this error will only happen if the file is not GZIP compressed. Are you absolutely certain every file is a valid GZIP file? I strongly recommend verifying that for every file with gzip -t.

Hi @danking , thanks a lot for your reply.

I tried w/ gzip -t And there is one shard is broken, now I am trying to find a tool to re-format/fix it.

gzip: WGS_10554.vcf.gz: invalid compressed data--crc error
gzip: WGS_10554.vcf.gz: invalid compressed data--length error

Since this shard only affects one chromosome, I already successfully imported other chromosomes’ shards into one MT and export to vcf.bgz files.

Thanks a lot!

1 Like