Hail 0.2 RVD error! Keys found out of order

cristinaluengoagullo · March 21, 2019, 2:10pm

Hi!

I’m using Hail 0.2 to try to annotate a vcf with dbSNP information and I keep getting the following error:

Caused by: is.hail.utils.HailException: RVD error! Keys found out of order:
Current key: [1:17225770,[C,CT]]
Previous key: [1:17225770,[C,T]]
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.

With different keys every time. The code I try to run is (more or less):

t = hl.split_multi_hts(hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions))
.rows()
.key_by(“locus”,“alleles”)
.distinct()
dbsnp = hl.split_multi_hts(hl.import_vcf(str(sourcePathDbSNP),force_bgz=True,min_partitions=nPartitions))
.rows()
.key_by(“locus”,“alleles”)
.distinct()
annotated = t.annotate(rsid=dbsnp[vt.locus, t.alleles].rsid[dbsnp[t.locus, t.alleles].a_index-1])

The error is pretty self explanatory, and I thought with the distinct clause the issue would be solved…but I’m clearly missing something!

Thanks in advance!

tpoterba · March 21, 2019, 3:24pm

The problem is related to the input VCF having a bad property (duplicated loci). If you do the following:

mt = hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions)```
mt = mt.key_rows_by('locus').distinct_by_row().key_rows_by('locus', 'alleles')

before splitting mt, that should fix it.

cristinaluengoagullo · March 22, 2019, 9:08am

Ok thanks!!

cristinaluengoagullo · March 22, 2019, 11:39am

The thing is that if I key by locus first, I might lose some variants of interest. For example, for these ones it’s ok if I key by locus first:

1	7430975	rs763145705	C	CTTTTT,CTTTTTT,CTTTTTTTTTTTTTTTTTTTTTTT	.	.	RS=763145705;RSPOS=7430975;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000080005000002000204;GENEINFO=CAMTA1:23261;WGT=1;VC=DIV;INT;ASP;NOV
1	7430975	rs201185199	C	T	.	.	RS=201185199;RSPOS=7430975;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x050100080005000402000100;GENEINFO=CAMTA1:23261;WGT=1;VC=SNV;SLO;INT;ASP;HD
1	7430975	rs61387662	C	CTTTTTT	.	.	RS=61387662;RSPOS=7430990;dbSNPBuildID=129;SSR=0;SAO=0;VP=0x050100080005000102000200;GENEINFO=CAMTA1:23261;WGT=1;VC=DIV;SLO;INT;ASP;GNO

Because it’ll only keep the first variant, and when splitting, a new variant like the third one will appear.

However, some of the other variants that appear in the file would be lost when performing the distinct by locus:

1	41376705	rs541416298	C	CA,CAA	.	.	RS=541416298;RSPOS=41376705;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005170026000200;WGT=1;VC=DIV;ASP;VLD;G5A;G5;KGPhase3;CAF=0.7442,.,0.2558;COMMON=1
1	41376705	rs201756891	C	A	.	.	RS=201756891;RSPOS=41376705;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x050000000005000402000100;WGT=1;VC=SNV;ASP;HD
1	41376705	rs35587769	CA	C	.	.	RS=35587769;RSPOS=41376706;dbSNPBuildID=126;SSR=0;SAO=0;VP=0x050100000005000102000200;WGT=1;VC=DIV;SLO;ASP;GNO

What I was trying to do too is something like this:

t = hl.split_multi_hts(hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions)) \
      .rows() \
      .key_by(“locus”,“alleles”) \
      .distinct()
dbsnp = hl.import_vcf(str(sourcePathDbSNP),force_bgz=True,min_partitions=nPartitions) \
          .rows() \
          .key_by(“locus”,“alleles”) \
          .distinct()
dbsnp = hl.slit_multi_hts(dbsnp).distinct()

But that kept getting me the same error. Is there a way to split and deduplicate the results? I’m using dbSNP build 150 raw data to annotate, and it’s just the way it comes so I don’t really know what to do.
Thanks so much again!

danking · March 23, 2019, 5:49pm

Tim can give a better biological foundation for this, but it sounds like you need to filter to multiallelic sites, split multi on that, then union_rows with the bialleic sites.

cristinaluengoagullo · March 25, 2019, 2:34pm

Ok thanks so much!

tpoterba · March 25, 2019, 3:50pm

I don’t think I can comment on the biological foundation here, but I do think we could add a mode to split_multi that will accept input like this (at the cost of a longer runtime including a shuffle)

Topic		Replies	Views
RVD error! Keys found out of order Hail Query & hailctl	2	770	February 12, 2020
How "alleles" filed in a key generated Hail Query & hailctl	3	377	March 16, 2021
When running split_multi_hts, I received a key out of order error. Why do I receive that and what should I do? Hail Query & hailctl	5	557	June 13, 2022
Error split multiallelic Hail Query & hailctl	2	729	March 22, 2021
Duplicated Loci error on VDS Combiner Hail Query & hailctl	3	85	June 20, 2024

Hail 0.2 RVD error! Keys found out of order

Related topics