Export_vcf path with .bgz extension creates .bgz directory when using parallel

mwilson · March 12, 2020, 8:06pm

Hello,

I’m currently exporting a MT to VCF using parallel=header_per_shard and noticed that the .bgz extension is necessary in the provided output path in order to block gzip the shards. This leads to misleading file paths such as gs://gnomad-mwilson/subsets/test_sharding.bgz/part-0000*.bgz. Would it be possible to add a flag to block zip instead of extracting that info from the file path? Or some other approach to allow for cleaner file paths such as gs://gnomad-mwilson/subsets/test_sharding/part-0000*.bgz? This isn’t a huge issue but requires a gsutil mv function in order to make cleaner file paths and that can be expensive.

Thank you!

tpoterba · March 13, 2020, 2:10pm

This is something we’ve wanted for a while, but is hard to implement in the current system for technical reasons related to core Spark infrastructure. I think it will be easy for us to make this change in ~3-6 months.

mwilson · March 13, 2020, 2:36pm

Cool, thanks Tim!

Topic		Replies	Views
Command to block gzip a vcf file Help [0.1]	2	935	June 30, 2017
Reading from GCS Hail Query & hailctl	7	591	January 19, 2022
Export VCF files as separate shards Updates	0	784	October 22, 2016
File doesn't conform to block zip format Hail Query & hailctl	9	1740	September 8, 2021
Possible vcf_combiner issue Hail Query & hailctl	19	1235	June 15, 2020

Export_vcf path with .bgz extension creates .bgz directory when using parallel

Related topics