Room for improvement when joining multiple HTs?

Hi Hail team,

We wrote this code for a decent number of reference data HT outer joins for our seqr loading pipeline back when hail 0.2.12 was the latest and greatest. It took just over 7 hours to run on dataproc using 20 n1-standard-8s on hail 0.2.12.

We are revisiting how we store our reference data and we’re wondering if there have been optimizations in the join method or if we should change how the joins happen in the linked code. If so, we want to be maximally efficient with a goal of doing this on the fly rather than storing the fully joined HT. This comes from an effort to unify our public reference data with gnomAD’s and the possibility of groups needing different versions of certain resources.

I appreciate any insight you can give.

Thank you!
Mike

We can definitely make this faster – the key is reading all the tables with the same partitions. I can’t help tonight, but give me a poke if I don’t get back to you by the end of tomorrow!

OK, so the gist of what we want to do is to use the same partitions to read each table. You should take the table you want to annotate and do the following to compute partition intervals that can be used to read everything:

partition_intervals = ht._calculate_new_partitions(N_PARTITIONS)

In this case, N_PARTITIONS should be a bit more than the existing number of partitions of ht above (more because we’re adding data).

Then you should reread the table that was used to calculate the partitions using that partitioning:

ht = hl.read_table(...path..., _intervals=partition_intervals)

Finally you should thread the _intervals argument through get_ht so that you can call get_ht using this partitioning:

    hts = [get_ht(dataset, reference_genome, _intervals=partition_intervals) for dataset in datasets]

You should also read the coverage table this this partitioning as well. The rest of the script should run great after this!

1 Like

Thank you Tim!