Merging single sample MatrixTables into one big MatrixTable

Hi,

I have converted many single sample VCFs, each into separate MatrixTable. Now I would like to merge them all together into one big MatrixTable. What would be the best way to do so?

Should I iterate through all the samples (MatrixTables), reading them one by one, merging them sequentially with “union_cols” and storing the merged .mt along the way, or is there a more efficient way?

Thanks

Hey @igorm,

Do you have GVCFs or single-sample project VCFs? If you have GVCFs, I recommend Hail’s new VDS representation and VDS combiner: Hail | Variant Dataset.

If you have single-sample project VCFs, take a look at Importing many sample-specific VCFs. I should caution you: there’s no principled way to combine project VCFs. Project VCFs lack reference block information, as a result, you cannot confidently decide what genotype to assign to an individual at a new variant. You’ll need to make that decision when you combine two project VCFs with differing sets of variants.

If at all possible, I recommend retrieving the original GVCF files and using those with Hail’s VDS tools.

@danking, thanks for fast response (as always).

union_cols chaining of MatrixTables is not a viable option?

mts = [mt1, mt2, mt3,..]
database = mts[0].union_cols(mts[1]).union_cols(mts[2]).write('database.mt')

Thanks.

Hey @igorm!

This will produce the expected output, however, it is slower than Tim’s suggestion. Tim’s suggestion incorporates two key insights:

  1. Unioning a very wide matrix table with a very thin matrix table is bottlenecked on the time to read and merge the very wide matrix table (because it contains more bytes). In particular, consider that mtN is merged into an mt that contains N-1 sequences.
  2. There is a practical limit on the number of files from which Hail can read. In theory, there’s the open files limit, but Hail will likely fail before that due to Hail’s per-file memory overhead.

Tim’s approach addresses (1) by using the same machinery that the VDS combiner uses, multi way zip join, to join multiple matrix tables simultaneously (rather than just two). Issue (2) is addressed by associating the union tables into a tree rather than a list. In particular, what you’ve written looks like this (I use uc as short than for union cols):

uc(mt1, uc(mt2, uc(mt3, ... uc(mtN-1, mtN) ...)))

Tim’s approach looks, roughly, like this (in this picture uc writes and then reads the matrix table before returning it):

uc(uc(mt1, mt2, mt3, ... mtCHUNK_SIZE),
   uc(mtCHUNK_SIZE+1, ... mtCHUNK_SIZE*2))

EDIT: All that said, I can see why you’d be apprehensive about a pile of code on a forum. We should merge something like this into the mainline of Hail.