Variant representation normalization

dnafault · October 28, 2017, 2:04am

Based on results of splitting multi-allelic variants in Hail I see it applies normalization. Looks like left-aligned normalization but I can’t find any mention of algo used or any discussion of that matter. Any pointers to documentation or discussion would be appreciated.

tpoterba · October 28, 2017, 9:26am

This definition looks pretty correct. We trim identical bases from each side of the reference and alternate allele, and adjust the position as necessary.

This usually isn’t a problem for VCFs created by recent GATK versions – out of ~30M multiallelic variants in the raw gnomAD callset, there wasn’t one case where Hail had to trim anything.

dnafault · October 29, 2017, 4:51am

Well, precise details do matter here. I would be surprised if the above mentioned algo was really implemented in Hail as it requires a full reference genome. Besides, details like order of trimming also matter. Would be nice to see implementation in code.

Interesting. I would expect the opposite. The most multiallelic variants in our dataset require normalization - about 9M variants out of ~14M variants total in ~6M multiallelic sites (GATK 3.7).

tpoterba · October 29, 2017, 5:00am

Here’s the code: https://github.com/hail-is/hail/blob/0.1/src/main/scala/is/hail/variant/Variant.scala#L371

We don’t need the reference to compute the minimal representation, since we assume that the input variant is properly aligned. We’d be interested to see examples of multiallelic variants that do require position normalization – are you able to share a few examples?

dnafault · October 29, 2017, 10:00am

I see the reason for discrepancy. I talked about biallelics (i.e. whatever we get after splitting). You talked about normalization of multiallelic variants as a whole. Once we break them down to biallelics, we get mentioned proportion of variants that would require normalization.

without left extension and with workaround to avoid empty alleles, Hail’s implementation is quite different from one described here. It does not produce normalized representation in general case. See the first two examples.

About that assumption. Is there any guarantees that GATK will never produce variants alligned as those first two in a joint-called VCFs ?

tpoterba · October 29, 2017, 1:11pm

We’ve so far not seen an example where a left-aligned multiallelic variant can produce non-left-aligned split biallelics. Do you have an example of that?

We currently rely on input variants being left aligned, and the “minRep” function only trims unnecessary bases from the left+right.

dnafault · October 30, 2017, 4:31am

It wasn’t my point, Tim. It is not left-aligning that we can lose in transition from multiallelic to biallelics. It’s parsimony. Once we lose it, we can lose normalization property for biallelics produced by minRep (e.g. variants where ref ends with alt or vice versa).

Let’s see an example. “ref=ACG alt=CG,TAA”. It’s left-aligned multiallelic that produces left-aligned biallelics. But since parsimony is lost for ACG:CG, minRep will give us AC:C while normalized form would be rA:r where r is whatever is on the left of A in reference genome if it’s not A (or we need a shift further to left if it’s A).

tpoterba · October 30, 2017, 11:28am

Is that a real GATK-produced variant? In my (limited) experience with sequence data, that variant should have been represented starting from the previous base. If so, I’ll go chat with someone on that team today. I think it may also be useful to loop in Konrad or Laurent, who have much more experience with this data!

tpoterba · October 30, 2017, 7:04pm

We’ve discussed this internally and confirmed that Hail doesn’t apply any normalization when splitting multiallelics. We do apply left and right parsimony, which can move the genomic coordinate forward, but not backwards.

We’d still like to know if that’s a real GATK variant - it sounds like most people would represent that specific variant differently (perhaps as two separate polymorphisms), so this would be a good point of discussion with the GATK development team if so.

dnafault · October 31, 2017, 9:27am

No, it wasn’t from GATK. It was to demonstrate the idea. I don’t see how starting from the previous base would help. There is always a possibility that reference allele ends with the whole alt, regardless of where we started. (Starting from the previous base would also break multiallelic normalization.)

Here is another example. A real variant from our dataset (GATK 3.7). It is fully normalized multiallelic with a SNP, a complex (SNP+DEL, in truth it is a single deletion without SNP) and 2 deletions.

19 603124 . TTGAGGTGTGGGTGCCCCTCGTCCCACAGGGAAGAGGGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAATGCACCAGCCTGAGCTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGGAAGGCACCAGCCTGAGGTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAAGGCACCAGCC CTGAGGTGTGGGTGCCCCTCGTCCCACAGGGAAGAGGGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAATGCACCAGCCTGAGCTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGGAAGGCACCAGCCTGAGGTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAAGGCACCAGCC,*,C,T

Splitting it leads to losing of left alignment’ness in biallelic.

Hence, we have seen how in transition from multiallelic to biallelics we can lose both properties in biallelics - left alignment’ness and parsimony.

Topic		Replies	Views
Left-alignement, normalization, splitting multiallelics Hail Query & hailctl	3	660	November 30, 2020
Question about split_multi_hts() Hail Query & hailctl	2	1113	December 3, 2020
REF ALT miss-matching after split multi-allelic lines Hail Query & hailctl	4	187	March 27, 2024
`splitmulti` representation possibly incorrect? Help [0.1]	3	1026	November 28, 2016
How "alleles" filed in a key generated Hail Query & hailctl	3	377	March 16, 2021

Variant representation normalization

Related topics