Best approach to join a large number of tables to a matrix table?

Hello,

I have a large number of tables (6k files, each 17 mil rows, 6 non-key cols) that share the common key (locus, alelles) that I would like to join in a large matrix table to allow for querying across collumns.

At the moment I am transforming the tables (tsvs) into individual hail tables, then transforming those by to_matrix_table_row_major, and finally union_cols.

The issue I was seeing when pulling multiple of these together was StackOverflow. I suspect the execution plan was just too big to fit in memory. Reducing the number of union_cols executed before persisting the data fixed the issue.

At the moment I am batching the union_cols in a for loop and persisting the table at the end of each batch. I am not sure, however, if this will fail eventually.

I was wondering, is this a viable approach at all? The goal is to have single data source with column querying capability.

Thanks,
Jarda

Hi Jarda, this is quite reasonable. You’ll get the best performance using the multi_way_zip_join method on Table, and using a hierarchical merge that executes no more than ~100 joins at a time.

Here’s a script that should do what you want – this assumes, however, that the inputs have already been converted to single-column Hail MatrixTables with import_table and to_matrix_table_row_major, key_cols_by(col_identifier=...some_unique_identifier_per_file) then write. It also assumes that the row keys are exactly the same for each input.

Untested, there may be errors when you try this.

(also, I cannibalized this from a script for another hierarchical merge task, so some parts have unnecessary complexity your problem doesn’t require)

Hi Tim,

Thanks for the reply and code snippet. It is a bit lower-level than I expected. However, I will dig deeper in the internals to make sure I can use this.

I understand, that the vcf_combiner is doing the same thing in principle. I think having a generic method perform joins over large number of tables to create a matrix table would benefit others too - e.g. union_cols with variadic params of matrix tables.

Anyway, thanks again!

Cheers,
Jarda

Yes, I didn’t intend this as something you’d write yourself – this is something that can get you up and running right now until this is built in. It’s not 100% easy to add this as a feature, since I want to share infrastructure between the various combiner (GVCF/VCF/table) modules and need to figure out what design is the best one to do that.