Indel matching in semi_join/anti_join


I am using semi_join(), anti_join(), semi_join_rows() and anti_join_rows() to discover variants which are present in both of two datasets, or variants present in only one dataset.
I have a question about indel normalisation. Does Hail either left-align or normalise indels when running these functions?
i.e. will it recognise something like chr1:119753410 AACAC>AAC and chr1:119753412 CAC>C as being the same variant? Or if faced with two different representations of the same indel will it consider them as different variants?

Many thanks in advance


Hail doesn’t automatically compute the minimal representation of indels when you join. It’s certainly possible to do this yourself:

mt = mt.key_rows_by(**hl.min_rep(, mt.alleles))

ht = ht.key_by(**hl.min_rep(, ht.alleles))
1 Like

Thank you, that is very helpful :+1: