REF ALT miss-matching after split multi-allelic lines


I am a researcher working with WGS data. I have a question about Hail function split_multi_hts() in splitting multi-allelic lines.

The position needs to be split in vcf looks:

As it shows, it is a multi allelic line with Ref “C” and ALT1 to 5 .
The variant “1/2” which indicate “CAAAAAA:A” has 50 carrier.

The hail matrix table before split looks like:

after split:

As it shown, the variant “CAAAAAA:A” in vcf is mistakenly shown as “C:A” after split_multi_hts() with 50 carrier. It looks hail can only take “C” as ref and does not consider the other ALT alleles as Ref during splitting.

Could you let me know if there is a fix for that? if not, what do you suggest to identify similar splitted variant like that?

Thank you!


Hi @Wen_He,

I appologize if I am misunderstanding; I’m not a geneticist. But I believe this is working as intended. The documentation for split_multi_hts goes into a lot of detail.

In particular, any sample with genotype 1/2 (i.e. CAAAAAA:A) before splitting will have, after splitting, genotype 0/1 at variant ["C", "A"] and genotype 0/1 at variant ["C", "CAAAAAA"], and 0/0 at all other split variants at this locus.

Thanks @patrick-schultz !

For this multi allelic loci before split, the ref is “C” (0) and alt are CAAAAAA(1), A(2). If looking at the vcf line, the ref “C” is not the ref for the situation of “1/2”, which is “CAAAAAA:A”. (if C is ref , it would be “0/1” or “0/2” before split). So, after split, the “1/2” ref is “CAAAAAA” and the alt is “A”.

I agree with you that for the “0/1” after split is “C:CAAAAAA” and “0/2” after split is “C:A”, but not the “1/2”.

Please let me know if my understanding is mistaken. Thanks!

@danking @tpoterba could you please share some thoughts? Thanks!

Tim and Dan have sadly both left Hail for new adventures. :cry:

This isn’t how split_multi_hts works. The “notes” section of the documentation here has an example 3-allelic variant worked out in detail. But in brief, the reference allele is never changed; it splits into two bi-allelic variants with the original reference and one of the two alt alleles. A call for an allele not represented in the split row is mapped to 0, a call for the alt allele represented by this row is mapped to 1, and ref calls stay 0.