Hi Hail team,
We wrote this code for a decent number of reference data HT outer joins for our seqr loading pipeline back when hail 0.2.12 was the latest and greatest. It took just over 7 hours to run on dataproc using 20 n1-standard-8s on hail 0.2.12.
We are revisiting how we store our reference data and we’re wondering if there have been optimizations in the join method or if we should change how the joins happen in the linked code. If so, we want to be maximally efficient with a goal of doing this on the fly rather than storing the fully joined HT. This comes from an effort to unify our public reference data with gnomAD’s and the possibility of groups needing different versions of certain resources.
I appreciate any insight you can give.