Hi!
I’m using Hail 0.2 to try to annotate a vcf with dbSNP information and I keep getting the following error:
Caused by: is.hail.utils.HailException: RVD error! Keys found out of order:
Current key: [1:17225770,[C,CT]]
Previous key: [1:17225770,[C,T]]
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.
With different keys every time. The code I try to run is (more or less):
t = hl.split_multi_hts(hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions))
.rows()
.key_by(“locus”,“alleles”)
.distinct()
dbsnp = hl.split_multi_hts(hl.import_vcf(str(sourcePathDbSNP),force_bgz=True,min_partitions=nPartitions))
.rows()
.key_by(“locus”,“alleles”)
.distinct()
annotated = t.annotate(rsid=dbsnp[vt.locus, t.alleles].rsid[dbsnp[t.locus, t.alleles].a_index-1])
The error is pretty self explanatory, and I thought with the distinct clause the issue would be solved…but I’m clearly missing something!
Thanks in advance!
The problem is related to the input VCF having a bad property (duplicated loci). If you do the following:
mt = hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions)```
mt = mt.key_rows_by('locus').distinct_by_row().key_rows_by('locus', 'alleles')
before splitting mt
, that should fix it.
The thing is that if I key by locus first, I might lose some variants of interest. For example, for these ones it’s ok if I key by locus first:
1 |
7430975 |
rs763145705 |
C |
CTTTTT,CTTTTTT,CTTTTTTTTTTTTTTTTTTTTTTT |
. |
. |
RS=763145705;RSPOS=7430975;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000080005000002000204;GENEINFO=CAMTA1:23261;WGT=1;VC=DIV;INT;ASP;NOV |
1 |
7430975 |
rs201185199 |
C |
T |
. |
. |
RS=201185199;RSPOS=7430975;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x050100080005000402000100;GENEINFO=CAMTA1:23261;WGT=1;VC=SNV;SLO;INT;ASP;HD |
1 |
7430975 |
rs61387662 |
C |
CTTTTTT |
. |
. |
RS=61387662;RSPOS=7430990;dbSNPBuildID=129;SSR=0;SAO=0;VP=0x050100080005000102000200;GENEINFO=CAMTA1:23261;WGT=1;VC=DIV;SLO;INT;ASP;GNO |
Because it’ll only keep the first variant, and when splitting, a new variant like the third one will appear.
However, some of the other variants that appear in the file would be lost when performing the distinct by locus:
1 |
41376705 |
rs541416298 |
C |
CA,CAA |
. |
. |
RS=541416298;RSPOS=41376705;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005170026000200;WGT=1;VC=DIV;ASP;VLD;G5A;G5;KGPhase3;CAF=0.7442,.,0.2558;COMMON=1 |
1 |
41376705 |
rs201756891 |
C |
A |
. |
. |
RS=201756891;RSPOS=41376705;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x050000000005000402000100;WGT=1;VC=SNV;ASP;HD |
1 |
41376705 |
rs35587769 |
CA |
C |
. |
. |
RS=35587769;RSPOS=41376706;dbSNPBuildID=126;SSR=0;SAO=0;VP=0x050100000005000102000200;WGT=1;VC=DIV;SLO;ASP;GNO |
What I was trying to do too is something like this:
t = hl.split_multi_hts(hl.import_vcf(str(sourcePath),force_bgz=True,min_partitions=nPartitions)) \
.rows() \
.key_by(“locus”,“alleles”) \
.distinct()
dbsnp = hl.import_vcf(str(sourcePathDbSNP),force_bgz=True,min_partitions=nPartitions) \
.rows() \
.key_by(“locus”,“alleles”) \
.distinct()
dbsnp = hl.slit_multi_hts(dbsnp).distinct()
But that kept getting me the same error. Is there a way to split and deduplicate the results? I’m using dbSNP build 150 raw data to annotate, and it’s just the way it comes so I don’t really know what to do.
Thanks so much again!
1 Like
Tim can give a better biological foundation for this, but it sounds like you need to filter to multiallelic sites, split multi on that, then union_rows with the bialleic sites.
I don’t think I can comment on the biological foundation here, but I do think we could add a mode to split_multi that will accept input like this (at the cost of a longer runtime including a shuffle)
2 Likes