Reading from GCS

jrs · January 13, 2022, 4:24pm

I am trying to read a gzipped vcf from GCS with hl.import_vcf('gs://…gz, reference_genome=‘GRCh38’)
This encounters an exception:

.gz cannot be loaded in parallel. Is the file actually block gzipped?
If the file is actually block gzipped (even though its extension is .gz),
use the ‘force_bgz’ argument to treat all .gz file extensions as .bgz.
If you are sure that you want to load a non-block-gzipped file serially
on one core, use the ‘force’ argument.

The file I am trying to read is a simple vcf file and easily viewable with zcat.

danking · January 13, 2022, 5:24pm

Hey @jrs!

There are two, partially compatible, compression formats which, confusingly, both use the .gz file extension: block-gzip and gzip. A block-gzip file is a valid gzip file; however, a gzip file is not a valid block-gzip file. Block-gzip files can be read in parallel; however, normal gzip files cannot be read in parallel.

I suggest you try

hl.import_vcf('gs://..../file.vcf.gz', 
              force_bgz=True,
              reference_genome='GRCh38')

If the file is block-gzipped, everything will work properly. If it is not block gzipped, you will eventually encounter an error.

I’ll look into the possibility of automatically detecting that a file is block-gzipped so that we need not print this warning.

jrs · January 13, 2022, 6:09pm

Hi @danking,

Thank you for the crisp response.

With the force_bgz argument set to True worked. Thank you.

I get another error when I try to write to native format. I do this

vcf = hl.import_vcf()
vcf.write(‘vcf_mt’)

The write step returns the exception

NumberFormatException: For input string: “gnomAD-SV_v2.1_DEL_1_2”

That’s a string in the vcf file. What does hail not like?

tpoterba · January 13, 2022, 6:10pm

What is the full stack trace? The problem is that this is a NumberFormatException – this string appears in a field marked as numeric. You might try running a tool like vcf-validator prior to importing.

jrs · January 13, 2022, 7:08pm

That surprises me. The file is the output of the GATK-SV from Broad.

tpoterba · January 13, 2022, 9:22pm

What is the full error trace?

danking · January 19, 2022, 6:53pm

@jrs ,

Could you share the headers of the VCF and at least one line containing the string “gnomAD-SV_v2.1_DEL_1_2”?

jrs · January 19, 2022, 7:29pm

The vcf that gave the error was the output from a GATK-SV workflow that I helped testing. I send the error information to the developers (Broad) and the vcf final format is being updated.
I used a temporary correction code, which removed the error.

Juerg

Topic		Replies	Views
File doesn't conform to block zip format Hail Query & hailctl	9	1727	September 8, 2021
Import_vcf() report error Hail Query & hailctl	4	371	October 19, 2022
Help with import vcf and write Hail Query & hailctl	4	466	August 18, 2020
Command to block gzip a vcf file Help [0.1]	2	934	June 30, 2017
Zp to bgz convertion Hail Query & hailctl	3	327	September 10, 2021

Reading from GCS

Related topics