Export_vcf path with .bgz extension creates .bgz directory when using parallel


I’m currently exporting a MT to VCF using parallel=header_per_shard and noticed that the .bgz extension is necessary in the provided output path in order to block gzip the shards. This leads to misleading file paths such as gs://gnomad-mwilson/subsets/test_sharding.bgz/part-0000*.bgz. Would it be possible to add a flag to block zip instead of extracting that info from the file path? Or some other approach to allow for cleaner file paths such as gs://gnomad-mwilson/subsets/test_sharding/part-0000*.bgz? This isn’t a huge issue but requires a gsutil mv function in order to make cleaner file paths and that can be expensive.

Thank you!

This is something we’ve wanted for a while, but is hard to implement in the current system for technical reasons related to core Spark infrastructure. I think it will be easy for us to make this change in ~3-6 months.

1 Like

Cool, thanks Tim!