Avoiding duplicate rows after MatrixTable.union_rows


#1

I want to merge two genotype sets for the same cohort using MatrixTable.union_rows, but want to avoid duplicate rows. I previously tried to make a list of variants in the first set, and exclude them from the second set before merging, but there are 13 million variants in common, and the massive shuffle during the exclusion process crashed.

Is there a cleaner way to only merge from the second set the variants which are not already in the first set?


#2

something like this:

mt3 = mt1.union_rows(mt2.filter_rows(hl.is_missing(mt1.rows()[mt2.row_key])))

#3

union_rows shouldn’t require a shuffle


#4

Brilliant, thank you very much, I’ll try this :slight_smile: