How "alleles" filed in a key generated

is.hail.utils.HailException: RVD error! Keys found out of order:
Current key: { locus: { contig: chr1, position: 38046636 }, alleles: [2; GGCC, G] }
Previous key: { locus: { contig: chr1, position: 38046636 }, alleles: [2; GGCCGCC, G] }
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.

I read through two previous relevant posts and there might be a solution. However, my question is how this issue happens in the first place if we have variants look like below, specifically
I don’t understand how keys are generated from a vcf, I couldn’t find what alleles are formatted in Hail documents.
Could anyone help to explain a little bit more? Thanks!

[2; GGCC, G]

Variants in vcf:

chr1 38046636 . GGCCGCC G
chr1 38046636 . GGCCGCC GGCC
chr1 38046636 . GGCCGCC GGCCGCCGCC
chr1 38046640 . G A
chr1 38046640. G *

split_multi involves computing the minimal representation of a variant, which transforms GGCCGCC/GGCC into GGCC/G. That’s what you’re seeing in the error.

The core problem here is that split_multi doesn’t really play nicely with data that’s already been split (particular by another tool). It looks like your input has already split multiallelics into biallelics, so do you need to split at all?

1 Like

Can we test in Hail if we need to run split_multi before running it, so that not to get an error? We would then just do if-else and omit split_multi when necessary.

Yes, definitely:

contains_multiallelics = mt.aggregate_rows(hl.agg.max(hl.len(mt.alleles)) > 2)
1 Like