I have a large number of tables (6k files, each 17 mil rows, 6 non-key cols) that share the common key (locus, alelles) that I would like to join in a large matrix table to allow for querying across collumns.
At the moment I am transforming the tables (tsvs) into individual hail tables, then transforming those by to_matrix_table_row_major, and finally union_cols.
The issue I was seeing when pulling multiple of these together was StackOverflow. I suspect the execution plan was just too big to fit in memory. Reducing the number of union_cols executed before persisting the data fixed the issue.
At the moment I am batching the union_cols in a for loop and persisting the table at the end of each batch. I am not sure, however, if this will fail eventually.
I was wondering, is this a viable approach at all? The goal is to have single data source with column querying capability.