Best approach to join a large number of tables to a matrix table?


I have a large number of tables (6k files, each 17 mil rows, 6 non-key cols) that share the common key (locus, alelles) that I would like to join in a large matrix table to allow for querying across collumns.

At the moment I am transforming the tables (tsvs) into individual hail tables, then transforming those by to_matrix_table_row_major, and finally union_cols.

The issue I was seeing when pulling multiple of these together was StackOverflow. I suspect the execution plan was just too big to fit in memory. Reducing the number of union_cols executed before persisting the data fixed the issue.

At the moment I am batching the union_cols in a for loop and persisting the table at the end of each batch. I am not sure, however, if this will fail eventually.

I was wondering, is this a viable approach at all? The goal is to have single data source with column querying capability.


Hi Jarda, this is quite reasonable. You’ll get the best performance using the multi_way_zip_join method on Table, and using a hierarchical merge that executes no more than ~100 joins at a time.

Here’s a script that should do what you want – this assumes, however, that the inputs have already been converted to single-column Hail MatrixTables with import_table and to_matrix_table_row_major, key_cols_by(col_identifier=...some_unique_identifier_per_file) then write. It also assumes that the row keys are exactly the same for each input.

Untested, there may be errors when you try this.

(also, I cannibalized this from a script for another hierarchical merge task, so some parts have unnecessary complexity your problem doesn’t require)

Hi Tim,

Thanks for the reply and code snippet. It is a bit lower-level than I expected. However, I will dig deeper in the internals to make sure I can use this.

I understand, that the vcf_combiner is doing the same thing in principle. I think having a generic method perform joins over large number of tables to create a matrix table would benefit others too - e.g. union_cols with variadic params of matrix tables.

Anyway, thanks again!


Yes, I didn’t intend this as something you’d write yourself – this is something that can get you up and running right now until this is built in. It’s not 100% easy to add this as a feature, since I want to share infrastructure between the various combiner (GVCF/VCF/table) modules and need to figure out what design is the best one to do that.