Room for improvement when joining multiple HTs?

mwilson · September 30, 2021, 7:55pm

Hi Hail team,

We wrote this code for a decent number of reference data HT outer joins for our seqr loading pipeline back when hail 0.2.12 was the latest and greatest. It took just over 7 hours to run on dataproc using 20 n1-standard-8s on hail 0.2.12.

We are revisiting how we store our reference data and we’re wondering if there have been optimizations in the join method or if we should change how the joins happen in the linked code. If so, we want to be maximally efficient with a goal of doing this on the fly rather than storing the fully joined HT. This comes from an effort to unify our public reference data with gnomAD’s and the possibility of groups needing different versions of certain resources.

I appreciate any insight you can give.

Thank you!
Mike

tpoterba · September 30, 2021, 9:58pm

We can definitely make this faster – the key is reading all the tables with the same partitions. I can’t help tonight, but give me a poke if I don’t get back to you by the end of tomorrow!

tpoterba · October 4, 2021, 9:07pm

OK, so the gist of what we want to do is to use the same partitions to read each table. You should take the table you want to annotate and do the following to compute partition intervals that can be used to read everything:

partition_intervals = ht._calculate_new_partitions(N_PARTITIONS)

In this case, N_PARTITIONS should be a bit more than the existing number of partitions of ht above (more because we’re adding data).

Then you should reread the table that was used to calculate the partitions using that partitioning:

ht = hl.read_table(...path..., _intervals=partition_intervals)

Finally you should thread the _intervals argument through get_ht so that you can call get_ht using this partitioning:

    hts = [get_ht(dataset, reference_genome, _intervals=partition_intervals) for dataset in datasets]

You should also read the coverage table this this partitioning as well. The rest of the script should run great after this!

mwilson · October 6, 2021, 1:36pm

Thank you Tim!

mwilson · October 20, 2021, 7:04pm

Sorry Tim, small follow up question! Since we are joining a number of HTs, should I choose the largest HT to calculate the partition intervals? or does the size not matter and partition equality is most important factor?

tpoterba · October 20, 2021, 7:05pm

Using the largest is a good idea, that should work fine.

mwilson · November 4, 2021, 9:26pm

Sorry bringing this thread back again! I’m was getting a HailException: attempted to read unindexed data as indexed when I try to read in the publicly available gnomAD exomes sites HT. I added a small conditional to read that specific table without the passed intervals which allowed the script to continue up until the write out but then I see the unindexed error again. Any ideas on how to solve this?

tpoterba · November 23, 2021, 4:18pm

Hi Mike,
So sorry we didn’t get back to you on this.

I think the exome sites must be a super old file that was written before indices were introduced. This should succeed though – what’s the stack trace on the unindexed error?

Topic		Replies	Views
RemoteDisconnect when joining matrix tables Hail Query & hailctl	1	77	March 6, 2025
Outer Joining Multiple VCFs Hail Query & hailctl	2	691	April 3, 2020
Hail performance reference/benchmark Hail Query & hailctl	0	465	February 2, 2022
Table.annotate takes a while Hail Query & hailctl	6	404	March 15, 2021
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1708	December 20, 2018

Room for improvement when joining multiple HTs?

Related topics