I am trying to read a gzipped vcf from GCS with hl.import_vcf('gs://…gz, reference_genome=‘GRCh38’)
This encounters an exception:
.gz cannot be loaded in parallel. Is the file actually block gzipped?
If the file is actually block gzipped (even though its extension is .gz),
use the ‘force_bgz’ argument to treat all .gz file extensions as .bgz.
If you are sure that you want to load a non-block-gzipped file serially
on one core, use the ‘force’ argument.
The file I am trying to read is a simple vcf file and easily viewable with zcat.
There are two, partially compatible, compression formats which, confusingly, both use the
.gz file extension: block-gzip and gzip. A block-gzip file is a valid gzip file; however, a gzip file is not a valid block-gzip file. Block-gzip files can be read in parallel; however, normal gzip files cannot be read in parallel.
I suggest you try
If the file is block-gzipped, everything will work properly. If it is not block gzipped, you will eventually encounter an error.
I’ll look into the possibility of automatically detecting that a file is block-gzipped so that we need not print this warning.
Thank you for the crisp response.
With the force_bgz argument set to True worked. Thank you.
I get another error when I try to write to native format. I do this
vcf = hl.import_vcf()
The write step returns the exception
NumberFormatException: For input string: “gnomAD-SV_v2.1_DEL_1_2”
That’s a string in the vcf file. What does hail not like?
What is the full stack trace? The problem is that this is a NumberFormatException – this string appears in a field marked as numeric. You might try running a tool like vcf-validator prior to importing.
That surprises me. The file is the output of the GATK-SV from Broad.
What is the full error trace?
Could you share the headers of the VCF and at least one line containing the string “gnomAD-SV_v2.1_DEL_1_2”?
The vcf that gave the error was the output from a GATK-SV workflow that I helped testing. I send the error information to the developers (Broad) and the vcf final format is being updated.
I used a temporary correction code, which removed the error.