`splitmulti` representation possibly incorrect?

Super awesome to see this work. I do however have an issue with the splitmulti example.

In you example, you convert this:
A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
to this:
A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99

However, this representation incorrectly states that this sample has at least 1 allele that is reference. Instead of 0/1 in the modified representation, shouldn’t it be ./1 instead?

Your concern is certainly valid, and this is a question that has come up several times among the Hail users at Broad. Multiallelic splitting is a challenging problem because there’s no way to do it without losing some information, so one must decide which information can be discarded. The behavior you noted is intentional: the Hail splitmulti algorithm produces a new variant for each alternate allele by downcoding all other alternates to reference. In the A: C,T example, we split to (A || T) C and (A || C) T. A heterozygous non-ref (1/2 call) will split to 0/1 and 0/1 by design.

It’s unclear whether ./1 is actually valid within the VCF spec, which contributed partly to our design choices. We also haven’t seen other downstream tools that deal with ./1 appropriately. Downcoding also seems to behave nicely for downstream analyses (GWAS and RVAS) – it has the nice property of preserving:

  • total number of called alternate alleles
  • call rate
  • genotype quality at each call

We are by no means asserting that this is the only way to split, however! Could you provide us with a little more information about what sorts of downstream analyses you’re planning to do, and how they can use information about ./1?

One more note: Hail provides a little note on each genotype that was downcoded in splitmulti. You can ask whether a genotype contained another allele using the g.fakeRef method here: https://hail.is/reference.html. This information is not exported to VCF, though.

Here’s another thing that happens:

When you splitmulti, you take something like
1 13414 . C A,T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;

and end up with something like
1 13414 . C A AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
1 13414 . C T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;

which contains two values for AC and AF in each of the lines produced. This is expected behavior in a multiallelic VCF, but should use only one of those values for each biallelic line. When I see those last two lines, I assume that AC for A is 3 and AC for T is 1, but there’s no indication that this is actually the case. Also it’s a lot harder to parse.

That particular example would not be compliant with the spec. The parser would have to properly account for those cases.

In general, I prefer to be correct than easy given how often things change in this field. Who knows how work arounds will affect things downstream.