How to speed up export_vcf?

I have a workflow that essentially annotates a large vcf with gnomAD popmax values, filters by a threshold and exports back to vcf. The process so far has been prohibitively slow (about 15 minutes for just chr21, and about 1.5 hours for chr1).

For my latest run, I’m attempting to do all the chromosomes at once. I ran the script late last night, and it’s still running now, roughly 10 hours later.

Can anyone suggest ways to speed up my pipeline? (I’ll include the code in a separate post to avoid the spam filter).

-Thanks in advance

import hail as hl
import os

rg = hl.get_reference('GRCh38')
vcf="/path/to/3225171004313.GTAGAGCA.hard-filtered.vcf.gz"
check=f"/path/to/WGS/{os.path.basename(vcf)[:-2]}mt"
mt=hl.read_matrix_table(check)
ht=hl.read_table("/path/to/GNOMAD/gnomad.genomes.v3.1.1.sites.ht")
ht=ht.filter(~hl.is_missing(ht.popmax.AF)) #remove all popmax=NA
n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF) #add popmax values
n=n.filter_rows(n.popmax < 0.01, keep=True) #filter
n=n.repartition(25)

exp_vcf=f"/path/to/WGS/{os.path.basename(vcf).replace('hard', 'gnomad')}"
hl.export_vcf(n, exp_vcf, parallel='separate_header')

I’d recommend removing the repartition. That’s a very expensive operation.

I removed it, but I didn’t notice any improvement in runtime (however, definitely good to know that the partitioning isn’t necessary).

My latest idea is to reduce the size of the gnomad hail table by selecting the popmax column only. Then I’ll use that to annotate and filter my vcfs.

Question: In my script, after I do

ht=ht.select("popmax")

should I checkpoint the resulting ht object? Would that improve my performance at all- versus just using the ht directly to annotate?

(downstream would look something like this:)

n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF)
n=n.filter_rows(n.popmax < 0.01, keep=True)
hl.export_vcf(n, exp_vcf)

I think we need a bit more information about these inputs and your system.

  1. What are you running this pipeline on? A single server, or a cluster?
  2. How big is the VCF? How many data partitions does it have in matrixtable form (mt.n_partitions())?
  3. How big is the gnomAD table in bytes and how many partitions?

There’s a known performance issue with executing joins where the partitioning of the two sides differs greatly.

Got my answer. I essentially reduced the size of the gnomAD table down to only the necessary information. Annotation and filtering was done in about 15 minutes (previously took 1.5 hours).

For any future users that look this up: eliminate all unnecessary columns from gnomAD table (ht.select(), ht.checkpoint()), you’ll see a big improvement in runtime.

Thanks for the brainstorming @tpoterba you guys have built a first class package here :ok_hand:

Our compiler should be doing most of this automatically. I think what you’ve done is essentially work around the performance issues with the join that I’ve described, though – glad it works!