Hi @ConnorBarnhill !
Sorry for the delay in a response, I was out on vacation last week.
Hail’s performance and correctness critically relies on the ordering of data. All data in a Hail Table or MatrixTable is ordered by the “key”. In most genetics datasets, the key is compound and comprises the locus and the alleles.
split_multi_hts
assumes that the input dataset has one row per locus. Under that assumption, split_multi_hts
transforms the dataset to have one row per variant. When the input has one row per locus, Hail can efficiently and quickly perform this transformation. When the input has more than one row per locus, Hail cannot efficiently and quickly perform this transformation.
split_multi_hts
assumes the dataset has one row per locus because GATK VCFs follow this convention. split_multi_hts
first performs the efficient transformation, then efficiently verifies the ordering. The error message you see comes from the efficient verification.
There is no ordering of your data that would produce a correctly ordered output of split_multi_hts
.
Anyway, you have two options:
- use
split_multi_hts(..., shuffle=True)
to use a slow and inefficient transformation instead.
- remove already split variants from your dataset and only perform splitting on the multi allelic variants.
The second operation should be quite efficient, you can do that with something like this:
mt = hl.read_matrix_table(...)
multi = mt.filter_rows(hl.len(mt.alleles) > 2)
bi = mt.filter_rows(hl.len(mt.alleles) == 2)
split = hl.split_multi_hts(multi)
mt = split.union_rows(bi)