UTRannotator in hail's VEP is not shown in transcript_consequences field

I have added UTRannonator:

to the vep config file and successfully ran vep just in terminal:

/vep/ensembl-tools-release-99/vep -i ncbr_batch2_subset.vcf --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o vep_output

I checked vep_output and it does have some annotations from UTRannotator that are not zero: existing_outofframe_oorfs, existing_inframe_oorfs, existing_uorfs, five_prime_utr_variant_consequence, five_prime_utr_variant_annotation

But then I opened ipython and tried to run vep in hail 0.2:

import hail as hl
mt_old = hl.import_vcf('/datasets-grch38/ncbr_batch2_subset.vcf', reference_genome='GRCh38')
mt = hl.vep(mt_old, config='/vep/GRCh38/vep99-loftee-grch38.json', name='vep', block_size=1000
mt.vep.transcript_consequences.describe()

The last line showed me the structure of the transcript_consequences without the UTRannotator fields:

In [44]: mt.vep.transcript_consequences.describe()                                                                                              
--------------------------------------------------------
Type:
        array<struct {
        allele_num: int32, 
        amino_acids: str, 
        appris: str, 
        biotype: str, 
        canonical: int32, 
        ccds: str, 
        cdna_start: int32, 
        cdna_end: int32, 
        cds_end: int32, 
        cds_start: int32, 
        codons: str, 
        consequence_terms: array<str>, 
        distance: int32, 
        domains: array<struct {
            db: str, 
            name: str
        }>, 
        exon: str, 
        gene_id: str, 
        gene_pheno: int32, 
        gene_symbol: str, 
        gene_symbol_source: str, 
        hgnc_id: str, 
        hgvsc: str, 
        hgvsp: str, 
        hgvs_offset: int32, 
        impact: str, 
        intron: str, 
        lof: str, 
        lof_flags: str, 
        lof_filter: str, 
        lof_info: str, 
        minimised: int32, 
        polyphen_prediction: str, 
        polyphen_score: float64, 
        protein_end: int32, 
        protein_start: int32, 
        protein_id: str, 
        sift_prediction: str, 
        sift_score: float64, 
        strand: int32, 
        swissprot: str, 
        transcript_id: str, 
        trembl: str, 
        tsl: int32, 
        uniparc: str, 
        variant_allele: str
    }>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fed41b9ec50>
Index:
    ['row']
--------------------------------------------------------

So, now I am wondering why I can’t see it here since in the vep_output the fields show up under the transcript_consequences, e.g.:

..., "variant_class":"SNV","transcript_consequences":[{"existing_outofframe_oorfs":0,"existing_uorfs":0,"variant_allele":"C","cdna_end":126,"swissprot":["Q9NUD7"],"hgvsc":"ENST00000360321.7:c.-15C>G","allele_num":1,"hgnc_id":"HGNC:16227","strand":-1,"gene_symbol":"C20orf96","exon":"1/11","transcript_id":"ENST00000360321","cdna_start":126,"existing_inframe_oorfs":0,"gene_id":"ENSG00000196476","canonical":1,"appris":"P3","protein_id":"ENSP00000353470","mane":"NM_153269.3","uniparc":["UPI00001285F1"],"biotype":"protein_coding", ...

The config file - vep99-loftee-grch38.json is:

"command": [
    "/vep/ensembl-tools-release-99/vep",
    "--format", "vcf",
    "__OUTPUT_FORMAT_FLAG__",
    "--everything",
    "--allele_number",
    "--no_stats",
    "--cache", "--offline",
    "--minimal",
    "--verbose",
    "--assembly", "GRCh38",
    "--dir_cache", "/vep/vep_cache",
    "--plugin", "UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0",
    "--dir_plugins", "/vep/loftee_grch38",
    "-o", "STDOUT"
],
 "env": {
     "PERL5LIB": "/vep/loftee_grch38:/opt/seqr/perl5/lib/perl5:/vep/bioperl:/vep/bioperl/Bio:/vep/UTRannotator"
 },
...

The permissions of the UTRannotator folder were set up recursively for 755: chmod -R 755 /vep/UTRannotator

Also, I tried to show the fields (needed to filter out first * alleles, see the thread: VEP output variant not found in original variants): mt.vep.transcript_consequences.show(1) and strangely it shows warning messages saying that these UTRannotator fields are not present, e.g.:

2020-10-16 11:30:33 Hail: WARN: struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str} has no field existing_uorfs at <root>.transcript_consequences[element] for value JInt(0)

I added to vep_json_schema parameter of the main vep config vep99-loftee-grch38.json the parameters of the UTR annotation in transcript_consequences and it resulted in correct schema printed out:


transcript_consequences:Array[Struct{existing_inframe_oorfs:Int32,existing_outofframe_oorfs:Int32,existing_uorfs:Int32,five_prime_utr_variant_consequence:String,five_prime_utr_variant_annotation:String

However, when I tried to print out vep annotations for transcript_consequences it showed me zeros/NA for the fields where they should be non-zeros according to the vep_output:

mt.vep.transcript_consequences.show(100)

shows rows which have non-zero values as:

0,0,0,NA,NA

and rows which are all zeros as:

NA,NA,NA,NA,NA

0,0,0,NA,NA should be 0, 2, “…”, “…”. So, the annotation recognizes where there are values but somehow converts everything into zeros and NAs.

So, the main issue now comes down to Hail 0.2 VEP that utilizes UTRannotator. I believe that Hail incorrectly does VEP annotation when the plugin is added.

I’m sorry you’re running into trouble. I’ll try and look into this today.

Have you found particular variants that show up with nonzero values in the vep output, but show up as zero values in hail? Is there a public VCF somewhere I can run this on where I know there are variants that will have annotations from your plugin?

I wanted to post an issue on UTRannotator page and tried to use the solution supplied in the issue that was already posted on their github:

It worked out, I started to see correct numbers and not just zeros. So, it was the problem of UTRannotator not working with vep –cache flag. I needed to unbgzip the reference file that I had:

bgzip -d < human_ancestor.fa.gz > human_ancestor.fa

Then supply it in the vep99-loftee-grch38.json config:

"--fasta", "/vep/loftee_data_grch38/human_ancestor.fa"

And its good now.

Amazing! Glad it wasn’t a hail problem, much easier for me. Glad you’ve got a fix.

Thank you!

Actually it worked in ipython but not as a spark job for some reason. It gives me error code 25 without any additional debug output:

Hail version: 0.2.26-3a68be23cb82

Error summary: HailException: VEP command ‘/vep/ensembl-tools-release-99/vep --format vcf --json --everything --allele_number --no_stats --cache --offline --fasta /vep/loftee_data_grch38/human_ancestor.fa --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --plugin UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o STDOUT’ failed with non-zero exit status 25

It is strictly related to the addition of the –fasta parameter (after removal of it, it runs fine). I changed permissions on the –fasta to 777, so its not a permissions issue.

Three things:

  1. What do you mean it worked in ipython but not as a spark job? ipython is going to kick off a spark job.
  2. If you go back and run ipython again, does it work now? Sometimes VEP throws an error if you try to run it twice since it tries not to overwrite certain output files.
  3. Update your version of hail, you’re 30+ versions behind.

This is almost definitely a VEP issue, not a hail issue.

Yeah, sorry, I mean when I just launch ipython it does work, but when I submit spark job for 3 worker nodes it fails. I just tested it on 1 worker node and it works. Very strange, hard to say why, but I will try updating Hail.