What function could we use to calculate minimal representation of variant alleles? We know that split_multi is doing it behind the scenes but we removed it and want to do only a single step of minimal representation computation.
I wonder how to apply it to each row of MatrixTable? I tried this but not working:
mt.row_key = hl.expr.functions.min_rep(mt.row_key)
You can use annotate_rows
to create a new field
I tried that:
mt = mt.annotate_rows(alleles=hl.min_rep(mt.locus))
But its giving an error:
TypeError: min_rep() missing 1 required positional argument: 'alleles'
mt.locus
is of type LocusExpression
and it has both locus and alleles
min_rep
takes two arguments and you’re only giving it one.
Yeah, I see. I am not sure how to split LocusExpression into 2 of them and then assign correctly alleles and locus since the min_rep returns two arguments too.
I tried doing this:
mt = mt.annotate_rows(alleles=hl.min_rep(mt.locus, mt.alleles)[1])
But it gave an error:
ExpressionException: ‘MatrixTable.annotate_rows’: cannot overwrite key field ‘alleles’ with annotate, select or drop; use key_by to modify keys.
That’s the right thing to do, but you can’t overwrite the alleles
field since it’s a key field. You have to do mt = mt.key_rows_by(mt.locus)
to change it to just be keyed by locus. Then you can overwrite alleles
It seems this working by calling the way below. Is this a correct way to use it? Thanks!
mt.key_rows_by(locus=mt.locus, alleles=hl.min_rep(mt.locus, mt.alleles)[1]
Do you also want the new locus if it changes? min_rep
can compute a new locus too.
Pretty much we would like to replicate the same behavior in split_multi_hts, then I think we should use new locus as well. If I understand correctly, the min_rep will shift the position if it remove the left side amino acid? We would like to back up the original as alleles_old and locus_old as well.
mt = mt.annotate_rows(alleles_old=mt.alleles, locus_old=mt.locus)
mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1])
mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1])
should work correctly.