I have added UTRannonator:
to the vep config file and successfully ran vep just in terminal:
/vep/ensembl-tools-release-99/vep -i ncbr_batch2_subset.vcf --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --verbose --assembly GRCh38 --dir_cache /vep/vep_cache --fasta /vep/homo_sapiens/GRCh38/hg38.fa --plugin UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0 --dir_plugins /vep/loftee_grch38 -o vep_output
I checked vep_output and it does have some annotations from UTRannotator that are not zero: existing_outofframe_oorfs, existing_inframe_oorfs, existing_uorfs, five_prime_utr_variant_consequence, five_prime_utr_variant_annotation
But then I opened ipython and tried to run vep in hail 0.2:
import hail as hl
mt_old = hl.import_vcf('/datasets-grch38/ncbr_batch2_subset.vcf', reference_genome='GRCh38')
mt = hl.vep(mt_old, config='/vep/GRCh38/vep99-loftee-grch38.json', name='vep', block_size=1000
mt.vep.transcript_consequences.describe()
The last line showed me the structure of the transcript_consequences without the UTRannotator fields:
In [44]: mt.vep.transcript_consequences.describe()
--------------------------------------------------------
Type:
array<struct {
allele_num: int32,
amino_acids: str,
appris: str,
biotype: str,
canonical: int32,
ccds: str,
cdna_start: int32,
cdna_end: int32,
cds_end: int32,
cds_start: int32,
codons: str,
consequence_terms: array<str>,
distance: int32,
domains: array<struct {
db: str,
name: str
}>,
exon: str,
gene_id: str,
gene_pheno: int32,
gene_symbol: str,
gene_symbol_source: str,
hgnc_id: str,
hgvsc: str,
hgvsp: str,
hgvs_offset: int32,
impact: str,
intron: str,
lof: str,
lof_flags: str,
lof_filter: str,
lof_info: str,
minimised: int32,
polyphen_prediction: str,
polyphen_score: float64,
protein_end: int32,
protein_start: int32,
protein_id: str,
sift_prediction: str,
sift_score: float64,
strand: int32,
swissprot: str,
transcript_id: str,
trembl: str,
tsl: int32,
uniparc: str,
variant_allele: str
}>
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fed41b9ec50>
Index:
['row']
--------------------------------------------------------
So, now I am wondering why I can’t see it here since in the vep_output the fields show up under the transcript_consequences, e.g.:
..., "variant_class":"SNV","transcript_consequences":[{"existing_outofframe_oorfs":0,"existing_uorfs":0,"variant_allele":"C","cdna_end":126,"swissprot":["Q9NUD7"],"hgvsc":"ENST00000360321.7:c.-15C>G","allele_num":1,"hgnc_id":"HGNC:16227","strand":-1,"gene_symbol":"C20orf96","exon":"1/11","transcript_id":"ENST00000360321","cdna_start":126,"existing_inframe_oorfs":0,"gene_id":"ENSG00000196476","canonical":1,"appris":"P3","protein_id":"ENSP00000353470","mane":"NM_153269.3","uniparc":["UPI00001285F1"],"biotype":"protein_coding", ...
The config file - vep99-loftee-grch38.json is:
"command": [
"/vep/ensembl-tools-release-99/vep",
"--format", "vcf",
"__OUTPUT_FORMAT_FLAG__",
"--everything",
"--allele_number",
"--no_stats",
"--cache", "--offline",
"--minimal",
"--verbose",
"--assembly", "GRCh38",
"--dir_cache", "/vep/vep_cache",
"--plugin", "UTRannotator,LoF,human_ancestor_fa:/vep/loftee_data_grch38/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/vep/loftee_data_grch38/loftee.sql,loftee_path:/vep/loftee_grch38,gerp_bigwig:/vep/loftee_data_grch38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,run_splice_predictions:0",
"--dir_plugins", "/vep/loftee_grch38",
"-o", "STDOUT"
],
"env": {
"PERL5LIB": "/vep/loftee_grch38:/opt/seqr/perl5/lib/perl5:/vep/bioperl:/vep/bioperl/Bio:/vep/UTRannotator"
},
...
The permissions of the UTRannotator folder were set up recursively for 755: chmod -R 755 /vep/UTRannotator
Also, I tried to show the fields (needed to filter out first * alleles, see the thread: VEP output variant not found in original variants): mt.vep.transcript_consequences.show(1)
and strangely it shows warning messages saying that these UTRannotator fields are not present, e.g.:
2020-10-16 11:30:33 Hail: WARN: struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str} has no field existing_uorfs at <root>.transcript_consequences[element] for value JInt(0)