Hi Hail team,
We wrote this code for a decent number of reference data HT outer joins for our seqr loading pipeline back when hail 0.2.12 was the latest and greatest. It took just over 7 hours to run on dataproc using 20 n1-standard-8s on hail 0.2.12.
We are revisiting how we store our reference data and we’re wondering if there have been optimizations in the join method or if we should change how the joins happen in the linked code. If so, we want to be maximally efficient with a goal of doing this on the fly rather than storing the fully joined HT. This comes from an effort to unify our public reference data with gnomAD’s and the possibility of groups needing different versions of certain resources.
I appreciate any insight you can give.
We can definitely make this faster – the key is reading all the tables with the same partitions. I can’t help tonight, but give me a poke if I don’t get back to you by the end of tomorrow!
OK, so the gist of what we want to do is to use the same partitions to read each table. You should take the table you want to annotate and do the following to compute partition intervals that can be used to read everything:
partition_intervals = ht._calculate_new_partitions(N_PARTITIONS)
In this case, N_PARTITIONS should be a bit more than the existing number of partitions of
ht above (more because we’re adding data).
Then you should reread the table that was used to calculate the partitions using that partitioning:
ht = hl.read_table(...path..., _intervals=partition_intervals)
Finally you should thread the
_intervals argument through
get_ht so that you can call
get_ht using this partitioning:
hts = [get_ht(dataset, reference_genome, _intervals=partition_intervals) for dataset in datasets]
You should also read the coverage table this this partitioning as well. The rest of the script should run great after this!
Sorry Tim, small follow up question! Since we are joining a number of HTs, should I choose the largest HT to calculate the partition intervals? or does the size not matter and partition equality is most important factor?
Using the largest is a good idea, that should work fine.
Sorry bringing this thread back again! I’m was getting a
HailException: attempted to read unindexed data as indexed when I try to read in the publicly available gnomAD exomes sites HT. I added a small conditional to read that specific table without the passed intervals which allowed the script to continue up until the write out but then I see the unindexed error again. Any ideas on how to solve this?
So sorry we didn’t get back to you on this.
I think the exome sites must be a super old file that was written before indices were introduced. This should succeed though – what’s the stack trace on the unindexed error?