Reading from GCS

I am trying to read a gzipped vcf from GCS with hl.import_vcf('gs://…gz, reference_genome=‘GRCh38’)
This encounters an exception:

.gz cannot be loaded in parallel. Is the file actually block gzipped?
If the file is actually block gzipped (even though its extension is .gz),
use the ‘force_bgz’ argument to treat all .gz file extensions as .bgz.
If you are sure that you want to load a non-block-gzipped file serially
on one core, use the ‘force’ argument.

The file I am trying to read is a simple vcf file and easily viewable with zcat.

Hey @jrs!

There are two, partially compatible, compression formats which, confusingly, both use the .gz file extension: block-gzip and gzip. A block-gzip file is a valid gzip file; however, a gzip file is not a valid block-gzip file. Block-gzip files can be read in parallel; however, normal gzip files cannot be read in parallel.

I suggest you try

hl.import_vcf('gs://..../file.vcf.gz', 
              force_bgz=True,
              reference_genome='GRCh38')

If the file is block-gzipped, everything will work properly. If it is not block gzipped, you will eventually encounter an error.

I’ll look into the possibility of automatically detecting that a file is block-gzipped so that we need not print this warning.

1 Like

Hi @danking,

Thank you for the crisp response.

With the force_bgz argument set to True worked. Thank you.

I get another error when I try to write to native format. I do this

vcf = hl.import_vcf()
vcf.write(‘vcf_mt’)

The write step returns the exception

NumberFormatException: For input string: “gnomAD-SV_v2.1_DEL_1_2”

That’s a string in the vcf file. What does hail not like?

What is the full stack trace? The problem is that this is a NumberFormatException – this string appears in a field marked as numeric. You might try running a tool like vcf-validator prior to importing.

That surprises me. The file is the output of the GATK-SV from Broad.

What is the full error trace?