Hi @ConnorBarnhill !

Sorry for the delay in a response, I was out on vacation last week.

Hail’s performance and correctness critically relies on the ordering of data. All data in a Hail Table or MatrixTable is ordered by the “key”. In most genetics datasets, the key is compound and comprises the locus and the alleles.

`split_multi_hts`

assumes that the input dataset has one row per locus. Under that assumption, `split_multi_hts`

transforms the dataset to have one row per variant. When the input has one row per locus, Hail can efficiently and quickly perform this transformation. When the input has more than one row per locus, Hail cannot efficiently and quickly perform this transformation.

`split_multi_hts`

assumes the dataset has one row per locus because GATK VCFs follow this convention. `split_multi_hts`

first performs the efficient transformation, then efficiently verifies the ordering. The error message you see comes from the efficient verification.

There is no ordering of your data that would produce a correctly ordered output of `split_multi_hts`

.

Anyway, you have two options:

- use
`split_multi_hts(..., shuffle=True)`

to use a slow and inefficient transformation instead.
- remove already split variants from your dataset and only perform splitting on the multi allelic variants.

The second operation should be quite efficient, you can do that with something like this:

```
mt = hl.read_matrix_table(...)
multi = mt.filter_rows(hl.len(mt.alleles) > 2)
bi = mt.filter_rows(hl.len(mt.alleles) == 2)
split = hl.split_multi_hts(multi)
mt = split.union_rows(bi)
```