Hi there!

I’m a student who’s new in bioinformatics, and I have a few questions about `split_multi_hts()`.

So above is the results that I get when I do `split_multi_hts()`.
I thought normally, multi-alleles (e.g. Ref, Alt1, Alt2) would be split into Ref, Alt1 and Ref, Alt2
but here it seems a little different in that Ref,Alt1 shows up twice for my results & I couldn’t find Ref, Alt2.

So my questions are:

1. Is this normal?

2. And why isn’t the function showing all of the combinations? (if Ref / Alt1, and Ref / Alt2 showed up, what about Alt1/ Alt2?). I read the post here that explains what the function does, but I don’t get why Alt1/ Alt2 isn’t chosen.

3. And also, what is the purpose of splitting single row (multi-allelic variants) into multiple rows (bi-allelic variants)? I haven’t done future analyses yet, (I’m currently learning Hail tutorial) but is splitting going to be needed in other applications?

Thank you so much!

Min

These are good questions, thanks! I’ll answer each one individually.

1. This looks normal to me. Your original alleles were Ref=CCCT, Alts=[ CCCTCCT, C ]. This is being split into two biallelics variants, CCCT/CCCTCCT and CCCT/C. However, Hail also computes the “minimal representation” of each variant after splitting, shortening CCCT/CCCTCCT into the identical but more parsimonious C/CCCT after splitting.
2. Any individual split biallelic variant is a transformation of the original multiallelic variant where one alternate allele is left as-is, and each other alternate allele is “downcoded” to the reference. For instance, For the split allele corresponding to the allele ‘1’ in the original variant, both a 0/1 call and a 1/2 call would become 0/1 (the 2 allele is treated as ref).
3. I won’t have as good an answer for you as one of the scientists might, but here’s my understanding – many analyses/algorithms make sense in a biallelic world, but not in a multiallelic world. The gnomAD team and others split multiallelic variants into bialellic variants early in their analysis because it is much easier to write methods that, for instance, perform quality control checks independently per alternate allele rather than in a fully multiallelic representation. However, there are certainly many analyses where you’d want to leave variants as multiallelic.