Hello!
I’ve developed a pipeline using hail that runs QC and calculates stratified AFs fields by sex and ancestry. While working with hl.split_multi_hts() and symbolic alleles (*), I noticed that genotypes are not updated correctly when splitting multiallelic variants that contain spanning deletions.
Context: Position 7:55181318 has a deletion (ACC>G) that spans position 7:55181320. At position 7:55181320, the * allele represents the spanning deletion from the upstream variant.
Before the splitting:
7:55181318 | [“ACC”,“G”] | “SAMPLE1” | 0|0
7:55181318 | [“ACC”,“G”] | “SAMPLE2” | 1|1
7:55181318 | [“ACC”,“G”] | “SAMPLE3” | 0|1
7:55181318 | [“ACC”,“G”] | “SAMPLE4” | 0|1
7:55181320 | [“C”,“T”,“*”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”,“*”] | “SAMPLE2” | 2|2
7:55181320 | [“C”,“T”,“*”] | “SAMPLE3” | 1|2
7:55181320 | [“C”,“T”,“*”] | “SAMPLE4” | 0|2
After the splitting keep_star=False:
7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”] | “SAMPLE2” | 0|0 ← incorrect: was 2|2 (*|*)
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|0 ← incorrect: was 1|2 (T|*)
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|0 ← technically incorrect: was 0|2 (C|*)
Genotype calls that reference the * allele should become missing (.) after splitting, since the reference base at this position doesn’t exist on haplotypes carrying the spanning deletion:
7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”] | “SAMPLE2” | .|.
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|.
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|.
With keep_star=True: The split variants retain the * allele, but genotypes are still incorrectly remapped:
7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1 → this one is correct
7:55181320 | [“C”,“T”] | “SAMPLE2” | 0|0 → was 2|2, should be .|.
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|0 → was 1|2, has T but not C
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|0 → this one is correct, but technically is hemizygous not HOM REF
7:55181320 | [“C”,“*”] | “SAMPLE1” | 0|0 → this one is correct
7:55181320 | [“C”,”*“] | “SAMPLE2” | 1|1 → this is correct
7:55181320 | [“C”,” *“] | “SAMPLE3” | 0|1 → was 1|2, doesn’t have C at all
7:55181320 | [“C”,”*"] | “SAMPLE4” | 0|1 → this one is correct
The issue is that split_multi_hts() remaps allele indices but doesn’t properly handle cases where the remapped allele no longer exists for that sample due to a spanning deletion.
Thanks before hand for your help!