Based on hail documentation of import_vcfshown here, I modified the file extension from .gz to .bgz using a unix command. So, now my files look like as shown below
```
Test_t5.version9.info.gz
Test_t5.version9.dose.vcf.gz.tbi
Test_t5.version9.dose.vcf.bgz #see the change here
```
Now, when I tried to read the file
mt = hl.import_vcf('/home/usr/test/*vcf.bgz') # no error. import done in couple of seconds
However, when I did mt.count(), I got the below error
Hail version: 0.2.55-2802af64de39
Error summary: ZipException: File does not conform to block gzip format.
All my source files are in vcf.gz format along with vcf.gz.tbi and info.gz.
To fasten the import vcf operation, based on the instructions listed in the doc, I modified .gz to .bgz.
You can’t just change gz to bgz if the file is not actually block gzipped. A “block gzipped” file is just a gzipped file that is generated in a certain way. If you generate a gz file using Tabix, you’ll get one that is “block gzipped”, but it still just ends with gz. Not all gz files are block gzipped though, so you can’t just do a rename. Hail made the choice to use the extension bgz to identify files that are block gzipped, but as far as I know we are the only ones who do that.
@johnc1231 - can the above command help to blockzip? I haven’t tried or heard about tabix. new to this domain but can gzip work as well? Will learn to use tabix as well
Yeah, the above is what you want. I think John was referring to the fact that sometimes the bgzip tools bundled with tabix produce a default extension of .gz, which is really confusing for tools that require block gzipped files.
@tpoterba@johnc1231 - I came to know from online two approaches that can help us know whether our file is bgzipped or not.
Approach - 1
Execute a command - file *vcf.gz
if it provides an output which mentions extra field, then it is an indication that the file is bgzipped. When I did the same on my vcf files, I got the below output
Test.chr12.vcf.gz: gzip compressed data, extra field
So all my files are already bgzipped. Isn’t it?
Moreover, each of my input vcf. gz file was accompanied by .tbi and .info.gz files as well.
Approach - 2
In addition, I also tried another approach using hexdump from biostars
My files also produced the same output as above indicating its a bgzipped file.
However, hail threw an error as it’s doesn’t conform to block zip format. The two approaches that I tried indicates it is a bgzipped file. I also used force_bgz= True to treat gz as bgz based on the error message thrown by hail.
Any other reason which could have caused this error?
One unusual thing is, I did not enounter this error when I read the file using import_vcf. It happened only when I executed mt.count().
If its file format issue, shouldn’t it be during the file import stage?