When running split_multi_hts, I received a key out of order error. Why do I receive that and what should I do?

I get a similar error, I believe it’s due to indels present in the data along with multiallelics:

Error summary: HailException: RVD error! Keys found out of order:
Current key: { locus: { contig: chr1, position: 156591761 }, alleles: [2; C, CGCTGGG] }
Previous key: { locus: { contig: chr1, position: 156591761 }, alleles: [2; CGCCGGG, C] }

Could you please explain further what “Keys found out of order” means? What order do the keys need to be in? Why doesn’t split_multi_hts re-order keys to the expected order?

Hi @ConnorBarnhill !

Sorry for the delay in a response, I was out on vacation last week.

Hail’s performance and correctness critically relies on the ordering of data. All data in a Hail Table or MatrixTable is ordered by the “key”. In most genetics datasets, the key is compound and comprises the locus and the alleles.

split_multi_hts assumes that the input dataset has one row per locus. Under that assumption, split_multi_hts transforms the dataset to have one row per variant. When the input has one row per locus, Hail can efficiently and quickly perform this transformation. When the input has more than one row per locus, Hail cannot efficiently and quickly perform this transformation.

split_multi_hts assumes the dataset has one row per locus because GATK VCFs follow this convention. split_multi_hts first performs the efficient transformation, then efficiently verifies the ordering. The error message you see comes from the efficient verification.

There is no ordering of your data that would produce a correctly ordered output of split_multi_hts.

Anyway, you have two options:

  1. use split_multi_hts(..., shuffle=True) to use a slow and inefficient transformation instead.
  2. remove already split variants from your dataset and only perform splitting on the multi allelic variants.

The second operation should be quite efficient, you can do that with something like this:

mt = hl.read_matrix_table(...)
multi = mt.filter_rows(hl.len(mt.alleles) > 2)
bi = mt.filter_rows(hl.len(mt.alleles) == 2)

split = hl.split_multi_hts(multi)

mt = split.union_rows(bi)

I see, thanks for the detailed response!

Is there a way to make this assumption clearer in the docs? The data I’m working with (UK Biobank WES) is output from GLNexus, which doesn’t follow this convention (see Fig. 1 in their paper). I imagine I won’t be the only one who encounters this

Sure! I’ll add some verbiage to split_multi and split_multi_hts that clarifies how to use these methods with datasets that contains both split and non-split variants.

We’ve discussed some ideas about a more efficient splitting process for this kind of dataset, but we don’t have a concrete timeline for that.

This doesn’t help with an already produced dataset, but I should note that Hail has a VCF combiner that combines gVCFs into a “sparse” MatrixTable. The gnomAD team used this to generate and analyze their latest 150k WGS dataset.

@ConnorBarnhill, I’d appreciate any further comments you have on the verbiage in this docs change: [hail] note the assumptions made by split_multi by danking · Pull Request #10213 · hail-is/hail · GitHub