Hail export_vcf() extremely slow and stalls

Hello team,

I am running hl.export_vcf() to export a Hail table of approximately ~17,000 variants to a zipped vcf.bgz file. I am running the job on a hail cluster (configuration below) and the export step takes approximately 30 minutes. However, when I look at my cluster details under ‘Monitoring’, both the CPU Utilization and Network Bytes (and all other graphs) drop to zero for about the last 25 minutes of the 30 minute export step. And when I look at the file size (where it’s being exported to in a GCP Bucket) the file size doesn’t grow for the last 25 minutes of the export step. Any idea why it seems to be stalling out and taking so long? Let me know if you have any ideas!

Cluster:
hailctl dataproc start [name] --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --project broad-mpg-gnomad --num-secondary-workers=50 --max-age=5h --requester-pays-allow-all --pkgs=“git+https://github.com/broadinstitute/gnomad_methods.git@main”

Hail Init:
hl.init(
default_reference=“GRCh38”,
global_seed=args.hail_rand_seed,
tmp_dir=TMP_DIR,
quiet=True,
spark_conf={
“spark.hadoop.fs.gs.requester.pays.mode”: “AUTO”,
“spark.hadoop.fs.gs.requester.pays.project.id”: args.google_cloud_project,
},
)

Export step:
if args.export_vcf:
hl.export_vcf(
ht_final,
f"{args.final_path}/testing-vcf-export.vcf.bgz",
append_to_header=args.header_fix_path,
)
logger.info(“VCF as Zipped VCF.BGZ written to final path”)

If you export a single VCF (rather than using one of the parallel modes), there’s a single-threaded concatenation step at the end of the execution that stitches the parallel shards into a single file. That’s probably what you’re seeing.

where does ht_final come from? How many partitions?

Thanks for getting back to me so soon! It’s just a heavily filtered version of the gnomAD v3.1.2 release table, so my ht_final contains 9800 partitions. Is that high number what’s doing it?

Yeah, almost certainly. The concatenation step has to merge together 9800 tiny files, and each one has some latency to open/read/close.

I ran ht_final.naive_coalesce(100) and I reduced my export time down from 30 minutes to 3 minutes. Thank you for the pointer!