Based on results of splitting multi-allelic variants in Hail I see it applies normalization. Looks like left-aligned normalization but I can’t find any mention of algo used or any discussion of that matter. Any pointers to documentation or discussion would be appreciated.
This definition looks pretty correct. We trim identical bases from each side of the reference and alternate allele, and adjust the position as necessary.
This usually isn’t a problem for VCFs created by recent GATK versions – out of ~30M multiallelic variants in the raw gnomAD callset, there wasn’t one case where Hail had to trim anything.
Well, precise details do matter here. I would be surprised if the above mentioned algo was really implemented in Hail as it requires a full reference genome. Besides, details like order of trimming also matter. Would be nice to see implementation in code.
Interesting. I would expect the opposite. The most multiallelic variants in our dataset require normalization - about 9M variants out of ~14M variants total in ~6M multiallelic sites (GATK 3.7).
Here’s the code: https://github.com/hail-is/hail/blob/0.1/src/main/scala/is/hail/variant/Variant.scala#L371
We don’t need the reference to compute the minimal representation, since we assume that the input variant is properly aligned. We’d be interested to see examples of multiallelic variants that do require position normalization – are you able to share a few examples?
I see the reason for discrepancy. I talked about biallelics (i.e. whatever we get after splitting). You talked about normalization of multiallelic variants as a whole. Once we break them down to biallelics, we get mentioned proportion of variants that would require normalization.
without left extension and with workaround to avoid empty alleles, Hail’s implementation is quite different from one described here. It does not produce normalized representation in general case. See the first two examples.
About that assumption. Is there any guarantees that GATK will never produce variants alligned as those first two in a joint-called VCFs ?
We’ve so far not seen an example where a left-aligned multiallelic variant can produce non-left-aligned split biallelics. Do you have an example of that?
We currently rely on input variants being left aligned, and the “minRep” function only trims unnecessary bases from the left+right.
It wasn’t my point, Tim. It is not left-aligning that we can lose in transition from multiallelic to biallelics. It’s parsimony. Once we lose it, we can lose normalization property for biallelics produced by minRep (e.g. variants where ref ends with alt or vice versa).
Let’s see an example. “ref=ACG alt=CG,TAA”. It’s left-aligned multiallelic that produces left-aligned biallelics. But since parsimony is lost for ACG:CG, minRep will give us AC:C while normalized form would be rA:r where r is whatever is on the left of A in reference genome if it’s not A (or we need a shift further to left if it’s A).
Is that a real GATK-produced variant? In my (limited) experience with sequence data, that variant should have been represented starting from the previous base. If so, I’ll go chat with someone on that team today. I think it may also be useful to loop in Konrad or Laurent, who have much more experience with this data!
We’ve discussed this internally and confirmed that Hail doesn’t apply any normalization when splitting multiallelics. We do apply left and right parsimony, which can move the genomic coordinate forward, but not backwards.
We’d still like to know if that’s a real GATK variant - it sounds like most people would represent that specific variant differently (perhaps as two separate polymorphisms), so this would be a good point of discussion with the GATK development team if so.
No, it wasn’t from GATK. It was to demonstrate the idea. I don’t see how starting from the previous base would help. There is always a possibility that reference allele ends with the whole alt, regardless of where we started. (Starting from the previous base would also break multiallelic normalization.)
Here is another example. A real variant from our dataset (GATK 3.7). It is fully normalized multiallelic with a SNP, a complex (SNP+DEL, in truth it is a single deletion without SNP) and 2 deletions.
19 603124 . TTGAGGTGTGGGTGCCCCTCGTCCCACAGGGAAGAGGGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAATGCACCAGCCTGAGCTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGGAAGGCACCAGCCTGAGGTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAAGGCACCAGCC CTGAGGTGTGGGTGCCCCTCGTCCCACAGGGAAGAGGGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAATGCACCAGCCTGAGCTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGGAAGGCACCAGCCTGAGGTGTGGGCACCCTCGCCCCCCAGGGAGGAATGCCCGGGGCTGGTCCCATAGGTGCCTGGGGGAAGGCACCAGCC,*,C,T
Splitting it leads to losing of left alignment’ness in biallelic.
Hence, we have seen how in transition from multiallelic to biallelics we can lose both properties in biallelics - left alignment’ness and parsimony.