I have a workflow that essentially annotates a large vcf with gnomAD popmax values, filters by a threshold and exports back to vcf. The process so far has been prohibitively slow (about 15 minutes for just chr21, and about 1.5 hours for chr1).
For my latest run, I’m attempting to do all the chromosomes at once. I ran the script late last night, and it’s still running now, roughly 10 hours later.
Can anyone suggest ways to speed up my pipeline? (I’ll include the code in a separate post to avoid the spam filter).
Got my answer. I essentially reduced the size of the gnomAD table down to only the necessary information. Annotation and filtering was done in about 15 minutes (previously took 1.5 hours).
For any future users that look this up: eliminate all unnecessary columns from gnomAD table (ht.select(), ht.checkpoint()), you’ll see a big improvement in runtime.
Thanks for the brainstorming @tpoterba you guys have built a first class package here
Our compiler should be doing most of this automatically. I think what you’ve done is essentially work around the performance issues with the join that I’ve described, though – glad it works!