Question about split_multi_hts()

These are good questions, thanks! I’ll answer each one individually.

  1. This looks normal to me. Your original alleles were Ref=CCCT, Alts=[ CCCTCCT, C ]. This is being split into two biallelics variants, CCCT/CCCTCCT and CCCT/C. However, Hail also computes the “minimal representation” of each variant after splitting, shortening CCCT/CCCTCCT into the identical but more parsimonious C/CCCT after splitting.
  2. Any individual split biallelic variant is a transformation of the original multiallelic variant where one alternate allele is left as-is, and each other alternate allele is “downcoded” to the reference. For instance, For the split allele corresponding to the allele ‘1’ in the original variant, both a 0/1 call and a 1/2 call would become 0/1 (the 2 allele is treated as ref).
  3. I won’t have as good an answer for you as one of the scientists might, but here’s my understanding – many analyses/algorithms make sense in a biallelic world, but not in a multiallelic world. The gnomAD team and others split multiallelic variants into bialellic variants early in their analysis because it is much easier to write methods that, for instance, perform quality control checks independently per alternate allele rather than in a fully multiallelic representation. However, there are certainly many analyses where you’d want to leave variants as multiallelic.