Speeding up gnomAD annotation

With help from this forum, I was able to figure out how to add gnomAD MAFs to my variant calls from a recent project - YAY!

The process runs smoothly, but writing the result table with my variant calls and the gnomAD MAFs takes a LONG time (maybe 20-30m).

I got the rather brilliant idea that I could speed things up by restricting the gnomAD database using the bed file we use in our analysis pipeline.
##read in the hail table
gnomAD_raw = hl.read_table(’/Users/smcnulty/Desktop/COMPARATIVE_DBs/GNOMAD/g nomad.exomes.r2.1.1.sites.ht’)
##read in the bed file
bed = hl.import_bed(’/Users/smcnulty/Desktop/COMPARATIVE_DBs/GNOMAD/Variant Plex_Myeloid_TargetROI.withGene.noChr.bed’, reference_genome=‘GRCh37’)
##keep only those variants that overlap with my bed file
gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(bed[gnomAD_raw.locus]))

Nothing else in my script was changed, with the exception that I’m intersecting my variant matrix table with gnomAD_filtered instead of gnomAD_raw. However, with this change, the process runs out of memory and dies!

I have absolutely no idea why:
-The intersection of the two datasets (gnomAD and my variant calls) seems so fast but writing the stupid result table seems sooo slow.
-Why everything runs to completion when I’m using the full gnomAD dataset but dies when I’m using a relatively small subset.

Thanks in advance for any hints/advice. :slight_smile:

Thought it might be helpful to see the exact message:

Hi @smcnulty,

I’m sorry you’re having trouble with Hail!

Regarding speed, Hail is lazy, so it only executes your pipeline when you observe the output. For example, write, show, and collect All the dataset annotation is done by the write step.

Hail will already read as little of the gnomad data as possible, filtering based on another table won’t improve upon what Hail is doing. I’m not sure why that causes you to run out of memory. Hopefully someone from the compiler team can comment on that.

Regarding running this faster, what is your compute environment? It appears that you’re using a laptop. Have you tried setting PYSPARK_SUBMIT_ARGS to use all the memory on your laptop?

Thanks for the reply! The timing makes sense if everything is actually being processed at the write step. I’d run a few “show” commands to make sure everything looked correct, so I assumed all the matchups had already been done prior.

Yes, I’m just running this on my laptop. I’ll look into the command you recommended to see if it helps. I mainly just wanted to make sure I wasn’t doing something totally wrong during the write step to make it inefficient.