Corrupt GZIP trailer

Hi,

I am trying to read some VCF files but in several I get an error about “Corrupt GZIP trailer”. I read all using the force_bgz=True. What can be the reason?.

Can we see the full stack trace?

[Stage 0:=============>                                       (286 + 90) / 1107]Traceback (most recent call last):

File “”, line 1, in
File “/usr/local/lib/python3.7/site-packages/hail/matrixtable.py”, line 3345, in cache
return self.persist(‘MEMORY_ONLY’)
File “”, line 2, in persist
File “/usr/local/lib/python3.7/site-packages/hail/typecheck/check.py”, line 614, in wrapper
return original_func(*args, **kwargs)
File “/usr/local/lib/python3.7/site-packages/hail/matrixtable.py”, line 3384, in persist
return Env.backend().persist_matrix_table(self, storage_level)
File “/usr/local/lib/python3.7/site-packages/hail/backend/spark_backend.py”, line 322, in persist_matrix_table
return MatrixTable._from_java(self._jbackend.pyPersistMatrix(storage_level, self._to_java_matrix_ir(mt._mir)))
File “/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py”, line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File “/usr/local/lib/python3.7/site-packages/hail/backend/spark_backend.py”, line 42, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: ZipException: Corrupt GZIP trailer

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 370 in stage 0.0 failed 1 times, most recent failure: Lost task 370.0 in stage 0.0 (TID 370, localhost, executor driver): java.util.zip.ZipException: Corrupt GZIP trailer
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at is.hail.io.compress.BGzipInputStream.decompressNextBlock(BGzipInputStream.java:169)
at is.hail.io.compress.BGzipInputStream.read(BGzipInputStream.java:216)
at java.io.InputStream.read(InputStream.java:101)
at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:63)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.loadBuffer(GenericLines.scala:74)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.readLine(GenericLines.scala:172)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.hasNext(GenericLines.scala:192)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1223)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1221)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:232)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at is.hail.sparkextras.ContextRDD.crunJobWithIndex(ContextRDD.scala:228)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1221)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1296)
at is.hail.expr.ir.GenericTableValue.getRVDCoercer(GenericTableValue.scala:161)
at is.hail.expr.ir.GenericTableValue.toTableValue(GenericTableValue.scala:187)
at is.hail.io.vcf.MatrixVCFReader.apply(LoadVCF.scala:1771)
at is.hail.expr.ir.TableRead.execute(TableIR.scala:752)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:28)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistMatrix$1.apply(SparkBackend.scala:418)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistMatrix$1.apply(SparkBackend.scala:417)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.annotations.Region$.scoped(Region.scala:18)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:230)
at is.hail.backend.spark.SparkBackend.pyPersistMatrix(SparkBackend.scala:417)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

java.util.zip.ZipException: Corrupt GZIP trailer
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:225)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:119)
at is.hail.io.compress.BGzipInputStream.decompressNextBlock(BGzipInputStream.java:169)
at is.hail.io.compress.BGzipInputStream.read(BGzipInputStream.java:216)
at java.io.InputStream.read(InputStream.java:101)
at is.hail.relocated.org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:63)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.loadBuffer(GenericLines.scala:74)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.readLine(GenericLines.scala:172)
at is.hail.expr.ir.GenericLines$$anonfun$3$$anon$3.hasNext(GenericLines.scala:192)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:609)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1223)
at is.hail.rvd.RVD$$anonfun$32.apply(RVD.scala:1221)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:232)
at is.hail.sparkextras.ContextRDD$$anonfun$crunJobWithIndex$1.apply(ContextRDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Is this file actually block gzipped, or is it just gzipped?

I don’t know because I only receive these files, but I have 1K files, and only 5 or 6 output error, so I don’t think that are compressed different from the others.

In the Hail doc mention:
Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed ( .vcf ) or block compressed ( .vcf.bgz ). If you have a large compressed VCF that ends in .vcf.gz , it is likely that the file is actually block-compressed, and you should rename the file to .vcf.bgz accordingly. If you actually have a standard gzipped file, it is possible to import it to Hail using the force parameter. However, this is not recommended – all parsing will have to take place on one node because gzip decompression is not parallelizable. In this case, import will take significantly longer.

So, my files are produced by bgzip, and indexed with tabix. When I try file myfile outputs extra field, so I suppose is block compressed. But I get the error.

I haven’t seen this error before. Could you try tabix indexing these problematic files yourself after receiving them, to see if that succeeds? If it doesn’t, then the files are probably corrupted.

How can I catch this exception?. And another question related, the code is in pyhon but uses java, os possible to use the functions of Hail in Java or Scala?

You can catch hail.utils.java.FatalError.

The JVM API of Hail is not a public, stable API. We recommend against using it, though, of course, there is nothing to stop you from doing so.

About this issue, finally the files were corrupted, so when there were reuploaded to the server the import in Hail was ok