I am trying to read a gzipped vcf from GCS with hl.import_vcf('gs://…gz, reference_genome=‘GRCh38’)
This encounters an exception:
.gz cannot be loaded in parallel. Is the file actually block gzipped?
If the file is actually block gzipped (even though its extension is .gz),
use the ‘force_bgz’ argument to treat all .gz file extensions as .bgz.
If you are sure that you want to load a non-block-gzipped file serially
on one core, use the ‘force’ argument.
The file I am trying to read is a simple vcf file and easily viewable with zcat.
There are two, partially compatible, compression formats which, confusingly, both use the .gz file extension: block-gzip and gzip. A block-gzip file is a valid gzip file; however, a gzip file is not a valid block-gzip file. Block-gzip files can be read in parallel; however, normal gzip files cannot be read in parallel.
What is the full stack trace? The problem is that this is a NumberFormatException – this string appears in a field marked as numeric. You might try running a tool like vcf-validator prior to importing.
The vcf that gave the error was the output from a GATK-SV workflow that I helped testing. I send the error information to the developers (Broad) and the vcf final format is being updated.
I used a temporary correction code, which removed the error.