I am currently working on UKBB 450K WES data processing in DNAnexus environment. My end goal is to generate gene burden matrix table for entire data. Since, it can be costly and time taking, I have split the data into chromosomes and processing them one by one.
I have been trying to annotate one of the chromosomes hail matrix table with VEP annotation but it is taking forever. I looked deeper into the issue and I found out that initially VEP annotation task uses all the cores available but later on at the
collect step uses one or two cores thereby other nodes staying idle and just increasing the computational cost.
- Earlier I thought it was because of partitions being unequal, I tried repartitioning but that didn’t work
- I also tried cutting down my vep json schema to bare minimum things(in hope that on spot computation of frequencies might be taking time) but that didn’t work.
Here are the screenshots describing the same: VEP annotation for chromosomes - Album on Imgur
No. of nodes:
Type of instance:
DNAnexus on AWS
Size of input data:
Data point: We ended up annotating the chromosome from
382.9 GiB to
2969.2 GiB and paying a cost of
413.6 pounds in time span of
- What can I do speed up my VEP annotation task? How can I make it possible to process partitions in parallel?
- If not, should I use a smaller cluster and let it run for 2-3 days as an economic approach?
- Are there any other ways of annotating via hail?
PS: Our VEP cache data sits in HDFS cluster.