Question about split_multi_hts()

Hi there!

I’m a student who’s new in bioinformatics, and I have a few questions about split_multi_hts().

So above is the results that I get when I do split_multi_hts().
I thought normally, multi-alleles (e.g. Ref, Alt1, Alt2) would be split into Ref, Alt1 and Ref, Alt2
but here it seems a little different in that Ref,Alt1 shows up twice for my results & I couldn’t find Ref, Alt2.

So my questions are:

  1. Is this normal?

  2. And why isn’t the function showing all of the combinations? (if Ref / Alt1, and Ref / Alt2 showed up, what about Alt1/ Alt2?). I read the post here that explains what the function does, but I don’t get why Alt1/ Alt2 isn’t chosen.

  3. And also, what is the purpose of splitting single row (multi-allelic variants) into multiple rows (bi-allelic variants)? I haven’t done future analyses yet, (I’m currently learning Hail tutorial) but is splitting going to be needed in other applications?

Thank you so much!

Min

These are good questions, thanks! I’ll answer each one individually.

  1. This looks normal to me. Your original alleles were Ref=CCCT, Alts=[ CCCTCCT, C ]. This is being split into two biallelics variants, CCCT/CCCTCCT and CCCT/C. However, Hail also computes the “minimal representation” of each variant after splitting, shortening CCCT/CCCTCCT into the identical but more parsimonious C/CCCT after splitting.
  2. Any individual split biallelic variant is a transformation of the original multiallelic variant where one alternate allele is left as-is, and each other alternate allele is “downcoded” to the reference. For instance, For the split allele corresponding to the allele ‘1’ in the original variant, both a 0/1 call and a 1/2 call would become 0/1 (the 2 allele is treated as ref).
  3. I won’t have as good an answer for you as one of the scientists might, but here’s my understanding – many analyses/algorithms make sense in a biallelic world, but not in a multiallelic world. The gnomAD team and others split multiallelic variants into bialellic variants early in their analysis because it is much easier to write methods that, for instance, perform quality control checks independently per alternate allele rather than in a fully multiallelic representation. However, there are certainly many analyses where you’d want to leave variants as multiallelic.

Hi Tim! Thank you so much for your answer!

But what I don’t get is that because of the downcoding, the ref allele got the reads that belonged to the alt allele, making it seem that the sample actually has ref alelle.
It makes it seem that the sample now has 3 alleles (ref, alt and alt…). And also, the true genotype of the sample is lost in this way.

So is this ok? Why is this being done?

Will downstream applications know that the ref allele is sort of a “place holder” (for the above example) and gather the actual genotype for that locus?

Or does this not matter since the purpose of reporting the kinds of variants present in the sample is fulfilled?

Best,

Min