Question about split_multi_hts()

tpoterba · November 30, 2020, 2:52pm

These are good questions, thanks! I’ll answer each one individually.

This looks normal to me. Your original alleles were Ref=CCCT, Alts=[ CCCTCCT, C ]. This is being split into two biallelics variants, CCCT/CCCTCCT and CCCT/C. However, Hail also computes the “minimal representation” of each variant after splitting, shortening CCCT/CCCTCCT into the identical but more parsimonious C/CCCT after splitting.
Any individual split biallelic variant is a transformation of the original multiallelic variant where one alternate allele is left as-is, and each other alternate allele is “downcoded” to the reference. For instance, For the split allele corresponding to the allele ‘1’ in the original variant, both a 0/1 call and a 1/2 call would become 0/1 (the 2 allele is treated as ref).
I won’t have as good an answer for you as one of the scientists might, but here’s my understanding – many analyses/algorithms make sense in a biallelic world, but not in a multiallelic world. The gnomAD team and others split multiallelic variants into bialellic variants early in their analysis because it is much easier to write methods that, for instance, perform quality control checks independently per alternate allele rather than in a fully multiallelic representation. However, there are certainly many analyses where you’d want to leave variants as multiallelic.

Topic		Replies	Views
REF ALT miss-matching after split multi-allelic lines Hail Query & hailctl	4	187	March 27, 2024
`splitmulti` representation possibly incorrect? Help [0.1]	3	1026	November 28, 2016
Left-alignement, normalization, splitting multiallelics Hail Query & hailctl	3	660	November 30, 2020
Error in hl.split_multi_hts() output on UK Biobank data? Hail Query & hailctl	2	473	December 10, 2022
How "alleles" filed in a key generated Hail Query & hailctl	3	377	March 16, 2021