VEP memoization caching

Our pipeline has a VEP step that takes quite a bit of time because of the computations. We also have a pretty standard set of variants we VEP. We were wondering if it’s possible to memoize these annotations and use those first (i.e. pre-memoize the common ones to a cache, and only VEP on ones not in the cache and update cache accordingly).

Please let us know if this is feasible and recommended. If so, what is the best way to store the cache?

  • HT: It seems like when we update the cache, we’d need to rewrite everything?
  • AnnotationsDB: Is something I’ve heard thrown around, but is not available in v02 (yet)?
  • An append-only file with variant to vep struct mapping, read in as a hail table?

Thanks!

We at one point had exactly this design – store the VEPed whole genome SNPs (9B) as a table, join that, and run VEP on the rest. But we stopped doing that when VEP got faster. Maybe it’s gotten slower again in recent versions?

Thanks Tim. We haven’t tried the new version of VEP yet, will check it out and see if it’s still a problem. Good to know that the table idea is a good candidate solution.