VEP output variant not found in original variants

Hi,
I successfully installed Hail and VEP on all the nodes of a cluster.
I use python 3.6.8, spark 2.4.2, hail 0.2.25, vep from docker image konradjk/vep95_loftee:0.2

I copied vep_data folder containing cache and fasta for homo_sapiens / 95_GRCh38

I created a vep executable as seen on hail docs and scripts I found on Google cloud hail-common

export IMAGE=$IMAGE

sudo touch /vep.c 
sudo chmod 666 /vep.c

sudo cat >/vep.c <<EOF
#include <unistd.h>
#include <stdio.h>

int
main(int argc, char *const argv[]) {
  if (setuid(geteuid()))
    perror( "setuid" );

  execv("/vep.sh", argv);
  return 0;
}
EOF

sudo gcc -Wall -Werror -O2 /vep.c -o /vep
sudo chmod u+s /vep

sudo touch /vep.sh 
sudo chmod 666 /vep.sh
sudo cat >/vep.sh <<EOF
#!/bin/bash

docker run -i -v /opt/vep/vep_data/:/opt/vep/.vep/:ro ${IMAGE} \
  /opt/vep/src/ensembl-vep/vep "\$@"
EOF

sudo chmod +x /vep.sh

I also have a vep_configuration file from the same source as follow:

{"command": [
  "/vep",
  "--format", "vcf",
  "__OUTPUT_FORMAT_FLAG__",
  "--everything",
  "--allele_number",
  "--no_stats",
  "--cache", "--offline",
  "--minimal",
  "--assembly", "GRCh38",
  "--fasta", "/opt/vep/.vep/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
  "--plugin", "LoF,loftee_path:/opt/vep/Plugins/,gerp_bigwig:/opt/vep/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:/opt/vep/.vep/human_ancestor.fa.gz,conservation_file:/opt/vep/.vep/loftee.sql",
  "--dir_plugins", "/opt/vep/Plugins/",
  "-o", "STDOUT"
],
"env": {
   "PERL5LIB": "/vep_data/loftee"
},
"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
}

Then I execute the following steps inside hail

# Load variants matrix table
mt_vcf = hl.read_matrix_table("....vcf.mt")
# Add vep fields
mt_vep = hl.vep(mt_vcf, "...vep_config.json")

Here I have the vep fields added to my matric table as expected. When I try to display the vep annotations I get the error as follow

mt_vep.vep.show(5)
Error summary: HailException: VEP output variant chr1:10146:AC: not found in original variants.
VEP output: {"input":"chr1\t10146\t.\tAC\t\t.\t.\tGT","seq_region_name":"chr1","variant_class":"sequence_alteration","allele_string":"AC/","strand":1,"end":10147,"start":10146,"most_severe_consequence":"?","id":".","assembly_name":"GRCh38"}

Here the variant causing error

>>> mt_vep.show(2)
+---------------+----------------+
| locus         | alleles        |
+---------------+----------------+
| locus<GRCh38> | array<str>     |
+---------------+----------------+
| chr1:10146    | ["AC","*"]     |
| chr1:10146    | ["AC","A"]     |
+---------------+----------------+

I tried to filter out variants to keep only snps:

mt_nogap = mt_vep.filter_rows(mt_vep.alleles[1] != '*')
mt_snp = mt_nogap.filter_rows(hl.is_snp(mt_nogap.alleles[0], mt_nogap.alleles[1]))
mt_snp.show(2)
+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh38> | array<str> |
+---------------+------------+
| chr1:10177    | ["A","C"]  |
| chr1:10180    | ["T","C"]  |
+---------------+------------+

After that I still get the same error about the variant chr1:10146:AC: that should not be in my mt anymore… I am a bit confuse, how to use properly VEP in hail ?

Thanks

Try doing the filter here:

mt_nogap = mt_vep.filter_rows(mt_vep.alleles[1] != '*')

before running VEP at all. It looks like VEP is doing strange things w.r.t. star alleles.

That works !

I load my matrix table, filter out the star alleles, then only call vep.
And I can see my variants annotated.

# Load VCF MatrixTable
mt_vcf = hl.read_matrix_table("...vcf.mt")
# Filter out star alleles
mt_nogap = mt_vcf.filter_rows(mt_vcf.alleles[1] != '*')
# Call VEP
mt_vep = hl.vep(mt_nogap, "...vep_config.json")
# Test vep
mt_vep.vep.show(1)
# SUCCESS

Thanks !