With help from this forum, I was able to figure out how to add gnomAD MAFs to my variant calls from a recent project - YAY!
The process runs smoothly, but writing the result table with my variant calls and the gnomAD MAFs takes a LONG time (maybe 20-30m).
I got the rather brilliant idea that I could speed things up by restricting the gnomAD database using the bed file we use in our analysis pipeline.
##read in the hail table
gnomAD_raw = hl.read_table(’/Users/smcnulty/Desktop/COMPARATIVE_DBs/GNOMAD/g nomad.exomes.r2.1.1.sites.ht’)
##read in the bed file
bed = hl.import_bed(’/Users/smcnulty/Desktop/COMPARATIVE_DBs/GNOMAD/Variant Plex_Myeloid_TargetROI.withGene.noChr.bed’, reference_genome=‘GRCh37’)
##keep only those variants that overlap with my bed file
gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(bed[gnomAD_raw.locus]))
Nothing else in my script was changed, with the exception that I’m intersecting my variant matrix table with gnomAD_filtered instead of gnomAD_raw. However, with this change, the process runs out of memory and dies!
I have absolutely no idea why:
-The intersection of the two datasets (gnomAD and my variant calls) seems so fast but writing the stupid result table seems sooo slow.
-Why everything runs to completion when I’m using the full gnomAD dataset but dies when I’m using a relatively small subset.
Thanks in advance for any hints/advice.