All,
Super awesome to see this work. I do however have an issue with the splitmulti example.
In you example, you convert this: A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
to this: A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99
However, this representation incorrectly states that this sample has at least 1 allele that is reference. Instead of 0/1 in the modified representation, shouldn’t it be ./1 instead?
Your concern is certainly valid, and this is a question that has come up several times among the Hail users at Broad. Multiallelic splitting is a challenging problem because there’s no way to do it without losing some information, so one must decide which information can be discarded. The behavior you noted is intentional: the Hail splitmulti algorithm produces a new variant for each alternate allele by downcoding all other alternates to reference. In the A: C,T example, we split to (A || T) C and (A || C) T. A heterozygous non-ref (1/2 call) will split to 0/1 and 0/1 by design.
It’s unclear whether ./1 is actually valid within the VCF spec, which contributed partly to our design choices. We also haven’t seen other downstream tools that deal with ./1 appropriately. Downcoding also seems to behave nicely for downstream analyses (GWAS and RVAS) – it has the nice property of preserving:
total number of called alternate alleles
call rate
genotype quality at each call
We are by no means asserting that this is the only way to split, however! Could you provide us with a little more information about what sorts of downstream analyses you’re planning to do, and how they can use information about ./1?
One more note: Hail provides a little note on each genotype that was downcoded in splitmulti. You can ask whether a genotype contained another allele using the g.fakeRef method here: https://hail.is/reference.html. This information is not exported to VCF, though.
When you splitmulti, you take something like 1 13414 . C A,T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
and end up with something like 1 13414 . C A AC=3,1;AF=7.253E-5,2.418E-5;AN=41362; 1 13414 . C T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
which contains two values for AC and AF in each of the lines produced. This is expected behavior in a multiallelic VCF, but should use only one of those values for each biallelic line. When I see those last two lines, I assume that AC for A is 3 and AC for T is 1, but there’s no indication that this is actually the case. Also it’s a lot harder to parse.