Symbolic Alleles and Overlapping Variants

Hello!

I’ve developed a pipeline using hail that runs QC and calculates stratified AFs fields by sex and ancestry. While working with hl.split_multi_hts() and symbolic alleles (*), I noticed that genotypes are not updated correctly when splitting multiallelic variants that contain spanning deletions.

Context: Position 7:55181318 has a deletion (ACC>G) that spans position 7:55181320. At position 7:55181320, the * allele represents the spanning deletion from the upstream variant.

Before the splitting:

7:55181318 | [“ACC”,“G”] | “SAMPLE1” | 0|0
7:55181318 | [“ACC”,“G”] | “SAMPLE2” | 1|1
7:55181318 | [“ACC”,“G”] | “SAMPLE3” | 0|1
7:55181318 | [“ACC”,“G”] | “SAMPLE4” | 0|1

7:55181320 | [“C”,“T”,“*”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”,“*”] | “SAMPLE2” | 2|2
7:55181320 | [“C”,“T”,“*”] | “SAMPLE3” | 1|2
7:55181320 | [“C”,“T”,“*”] | “SAMPLE4” | 0|2

After the splitting keep_star=False:

7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”] | “SAMPLE2” | 0|0 ← incorrect: was 2|2 (*|*)
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|0 ← incorrect: was 1|2 (T|*)
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|0 ← technically incorrect: was 0|2 (C|*)

Genotype calls that reference the * allele should become missing (.) after splitting, since the reference base at this position doesn’t exist on haplotypes carrying the spanning deletion:

7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1
7:55181320 | [“C”,“T”] | “SAMPLE2” | .|.
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|.
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|.

With keep_star=True: The split variants retain the * allele, but genotypes are still incorrectly remapped:

7:55181320 | [“C”,“T”] | “SAMPLE1” | 0|1 → this one is correct
7:55181320 | [“C”,“T”] | “SAMPLE2” | 0|0 → was 2|2, should be .|.
7:55181320 | [“C”,“T”] | “SAMPLE3” | 1|0 → was 1|2, has T but not C
7:55181320 | [“C”,“T”] | “SAMPLE4” | 0|0 → this one is correct, but technically is hemizygous not HOM REF

7:55181320 | [“C”,“*”] | “SAMPLE1” | 0|0 → this one is correct
7:55181320 | [“C”,”*“] | “SAMPLE2” | 1|1 → this is correct
7:55181320 | [“C”,” *“] | “SAMPLE3” | 0|1 → was 1|2, doesn’t have C at all
7:55181320 | [“C”,”*"] | “SAMPLE4” | 0|1 → this one is correct

The issue is that split_multi_hts() remaps allele indices but doesn’t properly handle cases where the remapped allele no longer exists for that sample due to a spanning deletion.

Thanks before hand for your help!

Thank you for the feedback. This is behaving as designed.

Hail does not permit half missing calls (1|.); this aligns with the VCF specification. Given that, there is no solution when splitting a spanning deletion that preserves the knowledge that the call included a spanning deletion (for the non-star alternate alleles).

You could annotate the genotypes to indicate if a spanning deletion was present before splitting.

# for performance reasons, save the index of the `*` allele on the rows
mt = mt.annotate_rows(star_idx=mt.alleles.index('*'))
mt = mt.annotate_entries(has_star_call=hl.coalesce(mt.GT.contains_allele(mt.star_idx), False))

Or do so with each haplotype in the call

# add star_idx as above
mt = mt.annotate_entries(is_star_allele=hl.range(mt.GT.ploidy).map(lambda i: mt.GT[i] == mt.star_idx))

which will produce a boolean (or missing) for each haplotype indicating if it was * before splitting.

I see, also I’ve checked and other tools behave the same with symbolic alleles.

Thank you so much for your time!

Mireia