UTRannotator in hail's VEP is not shown in transcript_consequences field

NLSVTN · October 16, 2020, 4:03pm

I have added UTRannonator:

to the vep config file and successfully ran vep just in terminal:

/vep/ensembl-tools-release-99/vep -i ncbr_batch2_subset.vcf --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o vep_output

I checked vep_output and it does have some annotations from UTRannotator that are not zero: existing_outofframe_oorfs, existing_inframe_oorfs, existing_uorfs, five_prime_utr_variant_consequence, five_prime_utr_variant_annotation

But then I opened ipython and tried to run vep in hail 0.2:

import hail as hl
mt_old = hl.import_vcf('/datasets-grch38/ncbr_batch2_subset.vcf', reference_genome='GRCh38')
mt = hl.vep(mt_old, config='/vep/GRCh38/vep99-loftee-grch38.json', name='vep', block_size=1000
mt.vep.transcript_consequences.describe()

The last line showed me the structure of the transcript_consequences without the UTRannotator fields:

In [44]: mt.vep.transcript_consequences.describe()                                                                                              
--------------------------------------------------------
Type:
        array<struct {
        allele_num: int32, 
        amino_acids: str, 
        appris: str, 
        biotype: str, 
        canonical: int32, 
        ccds: str, 
        cdna_start: int32, 
        cdna_end: int32, 
        cds_end: int32, 
        cds_start: int32, 
        codons: str, 
        consequence_terms: array<str>, 
        distance: int32, 
        domains: array<struct {
            db: str, 
            name: str
        }>, 
        exon: str, 
        gene_id: str, 
        gene_pheno: int32, 
        gene_symbol: str, 
        gene_symbol_source: str, 
        hgnc_id: str, 
        hgvsc: str, 
        hgvsp: str, 
        hgvs_offset: int32, 
        impact: str, 
        intron: str, 
        lof: str, 
        lof_flags: str, 
        lof_filter: str, 
        lof_info: str, 
        minimised: int32, 
        polyphen_prediction: str, 
        polyphen_score: float64, 
        protein_end: int32, 
        protein_start: int32, 
        protein_id: str, 
        sift_prediction: str, 
        sift_score: float64, 
        strand: int32, 
        swissprot: str, 
        transcript_id: str, 
        trembl: str, 
        tsl: int32, 
        uniparc: str, 
        variant_allele: str
    }>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fed41b9ec50>
Index:
    ['row']
--------------------------------------------------------

So, now I am wondering why I can’t see it here since in the vep_output the fields show up under the transcript_consequences, e.g.:

..., "variant_class":"SNV","transcript_consequences":[{"existing_outofframe_oorfs":0,"existing_uorfs":0,"variant_allele":"C","cdna_end":126,"swissprot":["Q9NUD7"],"hgvsc":"ENST00000360321.7:c.-15C>G","allele_num":1,"hgnc_id":"HGNC:16227","strand":-1,"gene_symbol":"C20orf96","exon":"1/11","transcript_id":"ENST00000360321","cdna_start":126,"existing_inframe_oorfs":0,"gene_id":"ENSG00000196476","canonical":1,"appris":"P3","protein_id":"ENSP00000353470","mane":"NM_153269.3","uniparc":["UPI00001285F1"],"biotype":"protein_coding", ...

The config file - vep99-loftee-grch38.json is:

"command": [
    "/vep/ensembl-tools-release-99/vep",
    "--format", "vcf",
    "__OUTPUT_FORMAT_FLAG__",
    "--everything",
    "--allele_number",
    "--no_stats",
    "--cache", "--offline",
    "--minimal",
    "--verbose",
    "--assembly", "GRCh38",
    "--dir_cache", "/vep/vep_cache",
    "--plugin", "UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0",
    "--dir_plugins", "/vep/loftee_grch38",
    "-o", "STDOUT"
],
 "env": {
     "PERL5LIB": "/vep/loftee_grch38:/opt/seqr/perl5/lib/perl5:/vep/bioperl:/vep/bioperl/Bio:/vep/UTRannotator"
 },
...

The permissions of the UTRannotator folder were set up recursively for 755: chmod -R 755 /vep/UTRannotator

Also, I tried to show the fields (needed to filter out first * alleles, see the thread: VEP output variant not found in original variants): mt.vep.transcript_consequences.show(1) and strangely it shows warning messages saying that these UTRannotator fields are not present, e.g.:

2020-10-16 11:30:33 Hail: WARN: struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str} has no field existing_uorfs at <root>.transcript_consequences[element] for value JInt(0)

NLSVTN · October 19, 2020, 6:38pm

I added to vep_json_schema parameter of the main vep config vep99-loftee-grch38.json the parameters of the UTR annotation in transcript_consequences and it resulted in correct schema printed out:

…
transcript_consequences:Array[Struct{existing_inframe_oorfs:Int32,existing_outofframe_oorfs:Int32,existing_uorfs:Int32,five_prime_utr_variant_consequence:String,five_prime_utr_variant_annotation:String
…

However, when I tried to print out vep annotations for transcript_consequences it showed me zeros/NA for the fields where they should be non-zeros according to the vep_output:

mt.vep.transcript_consequences.show(100)

shows rows which have non-zero values as:

0,0,0,NA,NA

and rows which are all zeros as:

NA,NA,NA,NA,NA

0,0,0,NA,NA should be 0, 2, “…”, “…”. So, the annotation recognizes where there are values but somehow converts everything into zeros and NAs.

NLSVTN · October 19, 2020, 9:48pm

So, the main issue now comes down to Hail 0.2 VEP that utilizes UTRannotator. I believe that Hail incorrectly does VEP annotation when the plugin is added.

johnc1231 · October 21, 2020, 1:48pm

I’m sorry you’re running into trouble. I’ll try and look into this today.

Have you found particular variants that show up with nonzero values in the vep output, but show up as zero values in hail? Is there a public VCF somewhere I can run this on where I know there are variants that will have annotations from your plugin?

NLSVTN · October 21, 2020, 1:54pm

I wanted to post an issue on UTRannotator page and tried to use the solution supplied in the issue that was already posted on their github:

It worked out, I started to see correct numbers and not just zeros. So, it was the problem of UTRannotator not working with vep –cache flag. I needed to unbgzip the reference file that I had:

bgzip -d < human_ancestor.fa.gz > human_ancestor.fa

Then supply it in the vep99-loftee-grch38.json config:

"--fasta", "/vep/loftee_data_grch38/human_ancestor.fa"

And its good now.

johnc1231 · October 21, 2020, 1:56pm

Amazing! Glad it wasn’t a hail problem, much easier for me. Glad you’ve got a fix.

NLSVTN · October 21, 2020, 2:00pm

Thank you!

NLSVTN · October 21, 2020, 3:54pm

Actually it worked in ipython but not as a spark job for some reason. It gives me error code 25 without any additional debug output:

Hail version: 0.2.26-3a68be23cb82

Error summary: HailException: VEP command ‘/vep/ensembl-tools-release-99/vep --format vcf --json --everything --allele_number --no_stats --cache --offline --fasta /vep/loftee_data_grch38/human_ancestor.fa --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --plugin UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o STDOUT’ failed with non-zero exit status 25

It is strictly related to the addition of the –fasta parameter (after removal of it, it runs fine). I changed permissions on the –fasta to 777, so its not a permissions issue.

johnc1231 · October 21, 2020, 3:59pm

Three things:

What do you mean it worked in ipython but not as a spark job? ipython is going to kick off a spark job.
If you go back and run ipython again, does it work now? Sometimes VEP throws an error if you try to run it twice since it tries not to overwrite certain output files.
Update your version of hail, you’re 30+ versions behind.

This is almost definitely a VEP issue, not a hail issue.

NLSVTN · October 21, 2020, 4:42pm

Yeah, sorry, I mean when I just launch ipython it does work, but when I submit spark job for 3 worker nodes it fails. I just tested it on 1 worker node and it works. Very strange, hard to say why, but I will try updating Hail.

Topic		Replies	Views
VEP fails to generate annotations Help [0.1]	17	2152	June 19, 2018
Transcript_consequences value is missing in matrix table Hail Query & hailctl	21	693	June 30, 2021
VEP output variant not found in original variants Hail Query & hailctl	2	968	October 30, 2019
Cant write out vep annotated vcf file Hail Query & hailctl	26	1782	May 27, 2019
VEP for deletions(SV) is not working Hail Query & hailctl	2	582	August 20, 2021

UTRannotator in hail's VEP is not shown in transcript_consequences field

Related topics