Super awesome to see this work. I do however have an issue with the
In you example, you convert this:
A C,T 1/2:2,8,6:16:45:99,50,99,45,0,99
A C 0/1:8,8:16:45:45,0,99 A T 0/1:10,6:16:50:50,0,99
However, this representation incorrectly states that this sample has at least 1 allele that is reference. Instead of
0/1 in the modified representation, shouldn’t it be
Your concern is certainly valid, and this is a question that has come up several times among the Hail users at Broad. Multiallelic splitting is a challenging problem because there’s no way to do it without losing some information, so one must decide which information can be discarded. The behavior you noted is intentional: the Hail
splitmulti algorithm produces a new variant for each alternate allele by downcoding all other alternates to reference. In the
A: C,T example, we split to
(A || T) C and
(A || C) T. A heterozygous non-ref (1/2 call) will split to 0/1 and 0/1 by design.
It’s unclear whether
./1 is actually valid within the VCF spec, which contributed partly to our design choices. We also haven’t seen other downstream tools that deal with
./1 appropriately. Downcoding also seems to behave nicely for downstream analyses (GWAS and RVAS) – it has the nice property of preserving:
- total number of called alternate alleles
- call rate
- genotype quality at each call
We are by no means asserting that this is the only way to split, however! Could you provide us with a little more information about what sorts of downstream analyses you’re planning to do, and how they can use information about
One more note: Hail provides a little note on each genotype that was downcoded in
splitmulti. You can ask whether a genotype contained another allele using the
g.fakeRef method here: https://hail.is/reference.html. This information is not exported to VCF, though.
Here’s another thing that happens:
splitmulti, you take something like
1 13414 . C A,T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
and end up with something like
1 13414 . C A AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
1 13414 . C T AC=3,1;AF=7.253E-5,2.418E-5;AN=41362;
which contains two values for
AF in each of the lines produced. This is expected behavior in a multiallelic VCF, but should use only one of those values for each biallelic line. When I see those last two lines, I assume that AC for A is 3 and AC for T is 1, but there’s no indication that this is actually the case. Also it’s a lot harder to parse.
That particular example would not be compliant with the spec. The parser would have to properly account for those cases.
In general, I prefer to be correct than easy given how often things change in this field. Who knows how work arounds will affect things downstream.