Avoiding duplicate rows after MatrixTable.union_rows

hhx037 · March 1, 2019, 1:46pm

I want to merge two genotype sets for the same cohort using MatrixTable.union_rows, but want to avoid duplicate rows. I previously tried to make a list of variants in the first set, and exclude them from the second set before merging, but there are 13 million variants in common, and the massive shuffle during the exclusion process crashed.

Is there a cleaner way to only merge from the second set the variants which are not already in the first set?

tpoterba · March 1, 2019, 1:53pm

something like this:

mt3 = mt1.union_rows(mt2.filter_rows(hl.is_missing(mt1.rows()[mt2.row_key])))

tpoterba · March 1, 2019, 1:53pm

union_rows shouldn’t require a shuffle

hhx037 · March 1, 2019, 3:15pm

Brilliant, thank you very much, I’ll try this

Topic		Replies	Views
[Hail 0.2] Merge two MatrixTable Help [0.1]	11	2983	November 19, 2019
After the union_cols() number of rows decreases Hail Query & hailctl	1	464	January 19, 2022
Merge MTs different order of alleles and locus Hail Query & hailctl	4	490	July 16, 2020
How to fix the error of 'MatrixTable.union_rows' expects all datasets to have the same columns Hail Query & hailctl	3	657	January 3, 2022
Merge vcf or matrix tables Hail Query & hailctl	4	572	January 14, 2021

Avoiding duplicate rows after MatrixTable.union_rows

Related topics