`splitmulti` representation possibly incorrect?

Your concern is certainly valid, and this is a question that has come up several times among the Hail users at Broad. Multiallelic splitting is a challenging problem because there’s no way to do it without losing some information, so one must decide which information can be discarded. The behavior you noted is intentional: the Hail splitmulti algorithm produces a new variant for each alternate allele by downcoding all other alternates to reference. In the A: C,T example, we split to (A || T) C and (A || C) T. A heterozygous non-ref (1/2 call) will split to 0/1 and 0/1 by design.

It’s unclear whether ./1 is actually valid within the VCF spec, which contributed partly to our design choices. We also haven’t seen other downstream tools that deal with ./1 appropriately. Downcoding also seems to behave nicely for downstream analyses (GWAS and RVAS) – it has the nice property of preserving:

  • total number of called alternate alleles
  • call rate
  • genotype quality at each call

We are by no means asserting that this is the only way to split, however! Could you provide us with a little more information about what sorts of downstream analyses you’re planning to do, and how they can use information about ./1?

One more note: Hail provides a little note on each genotype that was downcoded in splitmulti. You can ask whether a genotype contained another allele using the g.fakeRef method here: https://hail.is/reference.html. This information is not exported to VCF, though.