As it shown, the variant “CAAAAAA:A” in vcf is mistakenly shown as “C:A” after split_multi_hts() with 50 carrier. It looks hail can only take “C” as ref and does not consider the other ALT alleles as Ref during splitting.
Could you let me know if there is a fix for that? if not, what do you suggest to identify similar splitted variant like that?
I appologize if I am misunderstanding; I’m not a geneticist. But I believe this is working as intended. The documentation for split_multi_hts goes into a lot of detail.
In particular, any sample with genotype 1/2 (i.e. CAAAAAA:A) before splitting will have, after splitting, genotype 0/1 at variant ["C", "A"] and genotype 0/1 at variant ["C", "CAAAAAA"], and 0/0 at all other split variants at this locus.
For this multi allelic loci before split, the ref is “C” (0) and alt are CAAAAAA(1), A(2). If looking at the vcf line, the ref “C” is not the ref for the situation of “1/2”, which is “CAAAAAA:A”. (if C is ref , it would be “0/1” or “0/2” before split). So, after split, the “1/2” ref is “CAAAAAA” and the alt is “A”.
I agree with you that for the “0/1” after split is “C:CAAAAAA” and “0/2” after split is “C:A”, but not the “1/2”.
Please let me know if my understanding is mistaken. Thanks!
Tim and Dan have sadly both left Hail for new adventures.
This isn’t how split_multi_hts works. The “notes” section of the documentation here has an example 3-allelic variant worked out in detail. But in brief, the reference allele is never changed; it splits into two bi-allelic variants with the original reference and one of the two alt alleles. A call for an allele not represented in the split row is mapped to 0, a call for the alt allele represented by this row is mapped to 1, and ref calls stay 0.