How to speed up export_vcf?

mrdata · September 23, 2021, 3:55pm

I have a workflow that essentially annotates a large vcf with gnomAD popmax values, filters by a threshold and exports back to vcf. The process so far has been prohibitively slow (about 15 minutes for just chr21, and about 1.5 hours for chr1).

For my latest run, I’m attempting to do all the chromosomes at once. I ran the script late last night, and it’s still running now, roughly 10 hours later.

Can anyone suggest ways to speed up my pipeline? (I’ll include the code in a separate post to avoid the spam filter).

-Thanks in advance

mrdata · September 23, 2021, 3:57pm

import hail as hl
import os

rg = hl.get_reference('GRCh38')
vcf="/path/to/3225171004313.GTAGAGCA.hard-filtered.vcf.gz"
check=f"/path/to/WGS/{os.path.basename(vcf)[:-2]}mt"
mt=hl.read_matrix_table(check)
ht=hl.read_table("/path/to/GNOMAD/gnomad.genomes.v3.1.1.sites.ht")
ht=ht.filter(~hl.is_missing(ht.popmax.AF)) #remove all popmax=NA
n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF) #add popmax values
n=n.filter_rows(n.popmax < 0.01, keep=True) #filter
n=n.repartition(25)

exp_vcf=f"/path/to/WGS/{os.path.basename(vcf).replace('hard', 'gnomad')}"
hl.export_vcf(n, exp_vcf, parallel='separate_header')

tpoterba · September 23, 2021, 4:01pm

I’d recommend removing the repartition. That’s a very expensive operation.

mrdata · September 23, 2021, 9:30pm

I removed it, but I didn’t notice any improvement in runtime (however, definitely good to know that the partitioning isn’t necessary).

My latest idea is to reduce the size of the gnomad hail table by selecting the popmax column only. Then I’ll use that to annotate and filter my vcfs.

Question: In my script, after I do

ht=ht.select("popmax")

should I checkpoint the resulting ht object? Would that improve my performance at all- versus just using the ht directly to annotate?

(downstream would look something like this:)

n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF)
n=n.filter_rows(n.popmax < 0.01, keep=True)
hl.export_vcf(n, exp_vcf)

tpoterba · September 24, 2021, 12:39pm

I think we need a bit more information about these inputs and your system.

What are you running this pipeline on? A single server, or a cluster?
How big is the VCF? How many data partitions does it have in matrixtable form (mt.n_partitions())?
How big is the gnomAD table in bytes and how many partitions?

There’s a known performance issue with executing joins where the partitioning of the two sides differs greatly.

mrdata · September 24, 2021, 4:48pm

Got my answer. I essentially reduced the size of the gnomAD table down to only the necessary information. Annotation and filtering was done in about 15 minutes (previously took 1.5 hours).

For any future users that look this up: eliminate all unnecessary columns from gnomAD table (ht.select(), ht.checkpoint()), you’ll see a big improvement in runtime.

Thanks for the brainstorming @tpoterba you guys have built a first class package here

tpoterba · September 24, 2021, 5:11pm

Our compiler should be doing most of this automatically. I think what you’ve done is essentially work around the performance issues with the join that I’ve described, though – glad it works!

Topic		Replies	Views
Table.annotate takes a while Hail Query & hailctl	6	405	March 15, 2021
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	455	December 5, 2023
Writing my table as csv or vcf or ht takes too long Hail Query & hailctl	5	67	May 4, 2025
Improving pipeline performance Hail Query & hailctl	3	676	November 17, 2020
Trying to annotate vcf subset and then filter according to properties Hail Query & hailctl	9	54	March 21, 2025

How to speed up export_vcf?

Related topics