Corrupt GZIP trailer

Santiago · September 22, 2020, 11:38am

Hi,

I am trying to read some VCF files but in several I get an error about “Corrupt GZIP trailer”. I read all using the force_bgz=True. What can be the reason?.

tpoterba · September 22, 2020, 12:02pm

Can we see the full stack trace?

Santiago · September 22, 2020, 12:06pm

[Stage 0:=============>                                       (286 + 90) / 1107]Traceback (most recent call last):

File “”, line 1, in
File “/usr/local/lib/python3.7/site-packages/hail/matrixtable.py”, line 3345, in cache
return self.persist(‘MEMORY_ONLY’)
File “”, line 2, in persist
File “/usr/local/lib/python3.7/site-packages/hail/typecheck/check.py”, line 614, in wrapper
return original_func(*args, **kwargs)
File “/usr/local/lib/python3.7/site-packages/hail/matrixtable.py”, line 3384, in persist
return Env.backend().persist_matrix_table(self, storage_level)
File “/usr/local/lib/python3.7/site-packages/hail/backend/spark_backend.py”, line 322, in persist_matrix_table
return MatrixTable._from_java(self._jbackend.pyPersistMatrix(storage_level, self._to_java_matrix_ir(mt._mir)))
File “/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py”, line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File “/usr/local/lib/python3.7/site-packages/hail/backend/spark_backend.py”, line 42, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: ZipException: Corrupt GZIP trailer

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 370 in stage 0.0 failed 1 times, most recent failure: Lost task 370.0 in stage 0.0 (TID 370, localhost, executor driver): java.util.zip.ZipException: Corrupt GZIP trailer
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at is.hail.io.compress.BGzipInputStream.decompressNextBlock(BGzipInputStream.java:169)
at is.hail.io.compress.BGzipInputStream.read(BGzipInputStream.java:216)
at java.io.InputStream.read(InputStream.java:101)
at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:63)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.loadBuffer(GenericLines.scala:74)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.readLine(GenericLines.scala:172)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.hasNext(GenericLines.scala:192)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1223)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1221)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:232)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at is.hail.sparkextras.ContextRDD.crunJobWithIndex(ContextRDD.scala:228)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1221)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1296)
at is.hail.expr.ir.GenericTableValue.getRVDCoercer(GenericTableValue.scala:161)
at is.hail.expr.ir.GenericTableValue.toTableValue(GenericTableValue.scala:187)
at is.hail.io.vcf.MatrixVCFReader.apply(LoadVCF.scala:1771)
at is.hail.expr.ir.TableRead.execute(TableIR.scala:752)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:28)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistMatrix$1.apply(SparkBackend.scala:418)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistMatrix$1.apply(SparkBackend.scala:417)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.annotations.Region$.scoped(Region.scala:18)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:230)
at is.hail.backend.spark.SparkBackend.pyPersistMatrix(SparkBackend.scala:417)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

java.util.zip.ZipException: Corrupt GZIP trailer
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at is.hail.io.compress.BGzipInputStream.decompressNextBlock(BGzipInputStream.java:169)
at is.hail.io.compress.BGzipInputStream.read(BGzipInputStream.java:216)
at java.io.InputStream.read(InputStream.java:101)
at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:63)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.loadBuffer(GenericLines.scala:74)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.readLine(GenericLines.scala:172)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.hasNext(GenericLines.scala:192)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1223)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1221)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:232)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

tpoterba · September 22, 2020, 12:07pm

Is this file actually block gzipped, or is it just gzipped?

Santiago · September 22, 2020, 12:10pm

I don’t know because I only receive these files, but I have 1K files, and only 5 or 6 output error, so I don’t think that are compressed different from the others.

Santiago · September 22, 2020, 3:12pm

In the Hail doc mention:
Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed ( .vcf ) or block compressed ( .vcf.bgz ). If you have a large compressed VCF that ends in .vcf.gz , it is likely that the file is actually block-compressed, and you should rename the file to .vcf.bgz accordingly. If you actually have a standard gzipped file, it is possible to import it to Hail using the force parameter. However, this is not recommended – all parsing will have to take place on one node because gzip decompression is not parallelizable. In this case, import will take significantly longer.

So, my files are produced by bgzip, and indexed with tabix. When I try file myfile outputs extra field, so I suppose is block compressed. But I get the error.

tpoterba · September 22, 2020, 3:54pm

I haven’t seen this error before. Could you try tabix indexing these problematic files yourself after receiving them, to see if that succeeds? If it doesn’t, then the files are probably corrupted.

Santiago · September 22, 2020, 10:41pm

How can I catch this exception?. And another question related, the code is in pyhon but uses java, os possible to use the functions of Hail in Java or Scala?

danking · September 22, 2020, 10:51pm

You can catch hail.utils.java.FatalError.

The JVM API of Hail is not a public, stable API. We recommend against using it, though, of course, there is nothing to stop you from doing so.

Santiago · October 1, 2020, 7:29am

About this issue, finally the files were corrupted, so when there were reuploaded to the server the import in Hail was ok

Topic		Replies	Views
Import_vcf() report error Hail Query & hailctl	4	375	October 19, 2022
FatalError: IllegalArgumentException: requirement failed (Error occured during import_vcf) Hail Query & hailctl	6	729	April 11, 2022
Error summary: OutOfMemoryError: Java heap space Hail Query & hailctl	15	2585	August 18, 2022
java.lang.UnsatisfiedLinkError: is.hail.annotations.Region.nativeCtor() Help [0.1]	15	1047	July 19, 2018
Container killed on request. Exit code is 137 Hail Query & hailctl	8	602	October 26, 2021

Corrupt GZIP trailer

Related topics