Merging single sample MatrixTables into one big MatrixTable

igorm · November 17, 2022, 1:58pm

Hi,

I have converted many single sample VCFs, each into separate MatrixTable. Now I would like to merge them all together into one big MatrixTable. What would be the best way to do so?

Should I iterate through all the samples (MatrixTables), reading them one by one, merging them sequentially with “union_cols” and storing the merged .mt along the way, or is there a more efficient way?

Thanks

danking · November 17, 2022, 4:38pm

Hey @igorm,

Do you have GVCFs or single-sample project VCFs? If you have GVCFs, I recommend Hail’s new VDS representation and VDS combiner: Hail | Variant Dataset.

If you have single-sample project VCFs, take a look at Importing many sample-specific VCFs. I should caution you: there’s no principled way to combine project VCFs. Project VCFs lack reference block information, as a result, you cannot confidently decide what genotype to assign to an individual at a new variant. You’ll need to make that decision when you combine two project VCFs with differing sets of variants.

If at all possible, I recommend retrieving the original GVCF files and using those with Hail’s VDS tools.

igorm · November 18, 2022, 1:48pm

@danking, thanks for fast response (as always).

union_cols chaining of MatrixTables is not a viable option?

mts = [mt1, mt2, mt3,..]
database = mts[0].union_cols(mts[1]).union_cols(mts[2]).write('database.mt')

Thanks.

danking · November 18, 2022, 3:32pm

Hey @igorm!

This will produce the expected output, however, it is slower than Tim’s suggestion. Tim’s suggestion incorporates two key insights:

Unioning a very wide matrix table with a very thin matrix table is bottlenecked on the time to read and merge the very wide matrix table (because it contains more bytes). In particular, consider that mtN is merged into an mt that contains N-1 sequences.
There is a practical limit on the number of files from which Hail can read. In theory, there’s the open files limit, but Hail will likely fail before that due to Hail’s per-file memory overhead.

Tim’s approach addresses (1) by using the same machinery that the VDS combiner uses, multi way zip join, to join multiple matrix tables simultaneously (rather than just two). Issue (2) is addressed by associating the union tables into a tree rather than a list. In particular, what you’ve written looks like this (I use uc as short than for union cols):

uc(mt1, uc(mt2, uc(mt3, ... uc(mtN-1, mtN) ...)))

Tim’s approach looks, roughly, like this (in this picture uc writes and then reads the matrix table before returning it):

uc(uc(mt1, mt2, mt3, ... mtCHUNK_SIZE),
   uc(mtCHUNK_SIZE+1, ... mtCHUNK_SIZE*2))

EDIT: All that said, I can see why you’d be apprehensive about a pile of code on a forum. We should merge something like this into the mainline of Hail.

Topic		Replies	Views
Store multiple vcfs into single MatrixTable Hail Query & hailctl	10	753	September 9, 2020
[Hail 0.2] Merge two MatrixTable Help [0.1]	11	2972	November 19, 2019
Merge vcf or matrix tables Hail Query & hailctl	4	565	January 14, 2021
Appending to an existing Matrixtable Hail Query & hailctl	8	437	March 16, 2021
Import multi sample vcf Hail Query & hailctl	6	502	October 24, 2022

Merging single sample MatrixTables into one big MatrixTable

Related topics