We wrote this code for a decent number of reference data HT outer joins for our seqr loading pipeline back when hail 0.2.12 was the latest and greatest. It took just over 7 hours to run on dataproc using 20 n1-standard-8s on hail 0.2.12.
We are revisiting how we store our reference data and we’re wondering if there have been optimizations in the join method or if we should change how the joins happen in the linked code. If so, we want to be maximally efficient with a goal of doing this on the fly rather than storing the fully joined HT. This comes from an effort to unify our public reference data with gnomAD’s and the possibility of groups needing different versions of certain resources.
We can definitely make this faster – the key is reading all the tables with the same partitions. I can’t help tonight, but give me a poke if I don’t get back to you by the end of tomorrow!
OK, so the gist of what we want to do is to use the same partitions to read each table. You should take the table you want to annotate and do the following to compute partition intervals that can be used to read everything:
Sorry Tim, small follow up question! Since we are joining a number of HTs, should I choose the largest HT to calculate the partition intervals? or does the size not matter and partition equality is most important factor?
Sorry bringing this thread back again! I’m was getting a HailException: attempted to read unindexed data as indexed when I try to read in the publicly available gnomAD exomes sites HT. I added a small conditional to read that specific table without the passed intervals which allowed the script to continue up until the write out but then I see the unindexed error again. Any ideas on how to solve this?
Hi Mike,
So sorry we didn’t get back to you on this.
I think the exome sites must be a super old file that was written before indices were introduced. This should succeed though – what’s the stack trace on the unindexed error?