VEP annotation taking forever with nodes sitting idle in DNAnexus

Hi,

Context:
I am currently working on UKBB 450K WES data processing in DNAnexus environment. My end goal is to generate gene burden matrix table for entire data. Since, it can be costly and time taking, I have split the data into chromosomes and processing them one by one.

Challenge:
I have been trying to annotate one of the chromosomes hail matrix table with VEP annotation but it is taking forever. I looked deeper into the issue and I found out that initially VEP annotation task uses all the cores available but later on at the collect step uses one or two cores thereby other nodes staying idle and just increasing the computational cost.

Approaches taken:

  • Earlier I thought it was because of partitions being unequal, I tried repartitioning but that didn’t work
  • I also tried cutting down my vep json schema to bare minimum things(in hope that on spot computation of frequencies might be taking time) but that didn’t work.

Here are the screenshots describing the same: VEP annotation for chromosomes - Album on Imgur

Cluster specification:
No. of nodes: 7
Type of instance: mem2_ssd1_v2_x96
Cloud provider: DNAnexus on AWS
Size of input data: 382.9 GiB

Data point: We ended up annotating the chromosome from 382.9 GiB to 2969.2 GiB and paying a cost of 413.6 pounds in time span of 22 hours.

Questions:

  1. What can I do speed up my VEP annotation task? How can I make it possible to process partitions in parallel?
  2. If not, should I use a smaller cluster and let it run for 2-3 days as an economic approach?
  3. Are there any other ways of annotating via hail?

PS: Our VEP cache data sits in HDFS cluster.