Calculate minimal representation

What function could we use to calculate minimal representation of variant alleles? We know that split_multi is doing it behind the scenes but we removed it and want to do only a single step of minimal representation computation.

min_rep:

https://hail.is/docs/0.2/functions/genetics.html#hail.expr.functions.min_rep

1 Like

I wonder how to apply it to each row of MatrixTable? I tried this but not working:
mt.row_key = hl.expr.functions.min_rep(mt.row_key)

You can use annotate_rows to create a new field

I tried that:

mt = mt.annotate_rows(alleles=hl.min_rep(mt.locus))

But its giving an error:

TypeError: min_rep() missing 1 required positional argument: 'alleles'

mt.locus is of type LocusExpression and it has both locus and alleles

min_rep takes two arguments and you’re only giving it one.

Yeah, I see. I am not sure how to split LocusExpression into 2 of them and then assign correctly alleles and locus since the min_rep returns two arguments too.

Screen Shot 2021-07-16 at 3.49.39 PM

I tried doing this:

mt = mt.annotate_rows(alleles=hl.min_rep(mt.locus, mt.alleles)[1])

But it gave an error:

ExpressionException: ‘MatrixTable.annotate_rows’: cannot overwrite key field ‘alleles’ with annotate, select or drop; use key_by to modify keys.

That’s the right thing to do, but you can’t overwrite the alleles field since it’s a key field. You have to do mt = mt.key_rows_by(mt.locus) to change it to just be keyed by locus. Then you can overwrite alleles

1 Like

It seems this working by calling the way below. Is this a correct way to use it? Thanks!
mt.key_rows_by(locus=mt.locus, alleles=hl.min_rep(mt.locus, mt.alleles)[1]

Do you also want the new locus if it changes? min_rep can compute a new locus too.

Pretty much we would like to replicate the same behavior in split_multi_hts, then I think we should use new locus as well. If I understand correctly, the min_rep will shift the position if it remove the left side amino acid? We would like to back up the original as alleles_old and locus_old as well.

mt = mt.annotate_rows(alleles_old=mt.alleles, locus_old=mt.locus)
mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1])

mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1]) should work correctly.