Hail export_vcf() extremely slow and stalls

darn_matren · February 2, 2023, 8:53pm

Hello team,

I am running hl.export_vcf() to export a Hail table of approximately ~17,000 variants to a zipped vcf.bgz file. I am running the job on a hail cluster (configuration below) and the export step takes approximately 30 minutes. However, when I look at my cluster details under ‘Monitoring’, both the CPU Utilization and Network Bytes (and all other graphs) drop to zero for about the last 25 minutes of the 30 minute export step. And when I look at the file size (where it’s being exported to in a GCP Bucket) the file size doesn’t grow for the last 25 minutes of the export step. Any idea why it seems to be stalling out and taking so long? Let me know if you have any ideas!

Cluster:
hailctl dataproc start [name] --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --project broad-mpg-gnomad --num-secondary-workers=50 --max-age=5h --requester-pays-allow-all --pkgs=“git+https://github.com/broadinstitute/gnomad_methods.git@main”

Hail Init:
hl.init(
default_reference=“GRCh38”,
global_seed=args.hail_rand_seed,
tmp_dir=TMP_DIR,
quiet=True,
spark_conf={
“spark.hadoop.fs.gs.requester.pays.mode”: “AUTO”,
“spark.hadoop.fs.gs.requester.pays.project.id”: args.google_cloud_project,
},
)

Export step:
if args.export_vcf:
hl.export_vcf(
ht_final,
f"{args.final_path}/testing-vcf-export.vcf.bgz",
append_to_header=args.header_fix_path,
)
logger.info(“VCF as Zipped VCF.BGZ written to final path”)

tpoterba · February 2, 2023, 8:56pm

If you export a single VCF (rather than using one of the parallel modes), there’s a single-threaded concatenation step at the end of the execution that stitches the parallel shards into a single file. That’s probably what you’re seeing.

where does ht_final come from? How many partitions?

darn_matren · February 2, 2023, 9:05pm

Thanks for getting back to me so soon! It’s just a heavily filtered version of the gnomAD v3.1.2 release table, so my ht_final contains 9800 partitions. Is that high number what’s doing it?

tpoterba · February 2, 2023, 9:21pm

Yeah, almost certainly. The concatenation step has to merge together 9800 tiny files, and each one has some latency to open/read/close.

darn_matren · February 2, 2023, 9:35pm

I ran ht_final.naive_coalesce(100) and I reduced my export time down from 30 minutes to 3 minutes. Thank you for the pointer!

Topic		Replies	Views
Export_vcf very slow Hail Query & hailctl	0	99	June 16, 2024
VCF exporting issue Hail Query & hailctl	4	425	June 8, 2020
Writing my table as csv or vcf or ht takes too long Hail Query & hailctl	5	77	May 4, 2025
Hail Exception crash during export step - how to diagnose Hail Query & hailctl	4	1011	June 3, 2019
Fail write it in Hail format after loading a ~1Tb bgzipped VCF Hail Query & hailctl	6	785	February 14, 2019

Hail export_vcf() extremely slow and stalls

Related topics