File doesn't conform to block zip format


I already referred this post

I have a set of files like as shown below


Based on hail documentation of import_vcf shown here, I modified the file extension from .gz to .bgz using a unix command. So, now my files look like as shown below

    Test_t5.version9.dose.vcf.bgz     #see the change here

Now, when I tried to read the file

mt = hl.import_vcf('/home/usr/test/*vcf.bgz') # no error. import done in couple of seconds

However, when I did mt.count(), I got the below error

Hail version: 0.2.55-2802af64de39
Error summary: ZipException: File does not conform to block gzip format.

All my source files are in vcf.gz format along with vcf.gz.tbi and info.gz.

To fasten the import vcf operation, based on the instructions listed in the doc, I modified .gz to .bgz.

Did I interpret the instruction incorrectly?

How can I make my .gz to .bgz?

Sorry, I found how to convert .gz to .bgz using the below command

gunzip -c file.vcf.gz | bgzip > file.vcf.bgz

You can’t just change gz to bgz if the file is not actually block gzipped. A “block gzipped” file is just a gzipped file that is generated in a certain way. If you generate a gz file using Tabix, you’ll get one that is “block gzipped”, but it still just ends with gz. Not all gz files are block gzipped though, so you can’t just do a rename. Hail made the choice to use the extension bgz to identify files that are block gzipped, but as far as I know we are the only ones who do that.

1 Like

@johnc1231 - can the above command help to blockzip? I haven’t tried or heard about tabix. new to this domain but can gzip work as well? Will learn to use tabix as well

Yeah, the above is what you want. I think John was referring to the fact that sometimes the bgzip tools bundled with tabix produce a default extension of .gz, which is really confusing for tools that require block gzipped files.

@tpoterba @johnc1231 - I came to know from online two approaches that can help us know whether our file is bgzipped or not.

Approach - 1

Execute a command - file *vcf.gz

if it provides an output which mentions extra field, then it is an indication that the file is bgzipped. When I did the same on my vcf files, I got the below output

Test.chr12.vcf.gz: gzip compressed data, extra field

So all my files are already bgzipped. Isn’t it?

Moreover, each of my input vcf. gz file was accompanied by .tbi and .info.gz files as well.

Approach - 2

In addition, I also tried another approach using hexdump from biostars

$ hexdump -s 0 -n 2 -e '8/1 "%02x""\n"' some_file.gz
$ hexdump -s 3 -n 1 -e '8/1 "%d""\n"' some_file.gz | awk 'and($0,0x04){ print "extra header"; }'
extra header
$ hexdump -s 12 -n 2 -e '8/1 "%c""\n"' some_file.gz

My files also produced the same output as above indicating its a bgzipped file.

However, hail threw an error as it’s doesn’t conform to block zip format. The two approaches that I tried indicates it is a bgzipped file. I also used force_bgz= True to treat gz as bgz based on the error message thrown by hail.

Any other reason which could have caused this error?

One unusual thing is, I did not enounter this error when I read the file using import_vcf. It happened only when I executed mt.count().

If its file format issue, shouldn’t it be during the file import stage?

Can help us?

If your files had tabix indices (.tbi) then they’re almost certainly block gzipped. Could you paste the stack trace to the exception?

If its file format issue, shouldn’t it be during the file import stage?

Hail is a lazy execution engine, so the file isn’t actually parsed until you compute a result, like count() or export/write or aggregate.

Sorry, now I am not able to reproduce that error and I don’t see any error message now. Will test further and let you know.

Hi, May your issue was resolved?