Parsing VEP output

Andrea asked if I could post how I’ve been parsing the VEP output to get things like gene names and transcript consequences. This is what I’ve been doing (within an annotatevariants command):

if (isDefined(va.vep.transcript_consequences.find(c => c.canonical == 1 && c.biotype == "protein_coding" && isDefined(c.amino_acids) ))) 
va.vep.transcript_consequences.find(c => c.canonical == 1 && c.biotype == "protein_coding" && isDefined(c.amino_acids) ).gene_symbol 
else if (isDefined(va.vep.transcript_consequences.find(c => c.canonical == 1 && c.biotype == "protein_coding"))) 
va.vep.transcript_consequences.find(c => c.canonical == 1 && c.biotype == "protein_coding").gene_symbol 
else if (isDefined(va.vep.transcript_consequences.find(c => c.canonical == 1)))
va.vep.transcript_consequences.find(c => c.canonical == 1).gene_symbol 
else va.vep.transcript_consequences.find(c => c.consequence_terms.toSet.contains(va.vep.most_severe_consequence)).gene_symbol```

It's ugly, but what it says is: I want the information that goes with the canonical transcript that's protein coding and has a defined amino acids field.  If that doesn't exist, then I drop the amino acids field condition.  It it still doesn't exist, then I drop the protein coding condition.  If that still doesn't exist, then I want any canonical transcript.  If there's no canonical, then give me the transcript that goes with the most severe consequence.
1 Like

I do this similarly for things like the transcript consequences, loftee output, polyphen and sift output, and other things.

For reference, here is the schema of the annotations that VEP adds:

Edit: it’s not formatting correctly. Let me work on that.

Here’s the script I use:

annotatevariants expr -c 'va.transcript_canonicals = va.vep.transcript_consequences.filter(tc => tc.canonical == 1)' \
annotatevariants expr -c 'va.most_severe_csq = 
let x = va.transcript_canonicals.map(y => y.consequence_terms.toSet) in 
if (isMissing(x)) va.vep.most_severe_consequence
else if (x.exists(tc => tc.contains("transcript_ablation"))) "transcript_ablation"
else if (x.exists(tc => tc.contains("splice_acceptor_variant"))) "splice_acceptor_variant"
else if (x.exists(tc => tc.contains("splice_donor_variant"))) "splice_donor_variant"
else if (x.exists(tc => tc.contains("stop_gained"))) "stop_gained"
else if (x.exists(tc => tc.contains("frameshift_variant"))) "frameshift_variant"
else if (x.exists(tc => tc.contains("stop_lost"))) "stop_lost"
else if (x.exists(tc => tc.contains("start_lost"))) "start_lost"
else if (x.exists(tc => tc.contains("transcript_amplification"))) "transcript_amplification"
else if (x.exists(tc => tc.contains("inframe_insertion"))) "inframe_insertion"
else if (x.exists(tc => tc.contains("inframe_deletion"))) "inframe_deletion"
else if (x.exists(tc => tc.contains("missense_variant"))) "missense_variant"
else if (x.exists(tc => tc.contains("protein_altering_variant"))) "protein_altering_variant"
else if (x.exists(tc => tc.contains("splice_region_variant"))) "splice_region_variant"
else if (x.exists(tc => tc.contains("incomplete_terminal_codon_variant"))) "incomplete_terminal_codon_variant"
else if (x.exists(tc => tc.contains("stop_retained_variant"))) "stop_retained_variant"
else if (x.exists(tc => tc.contains("synonymous_variant"))) "synonymous_variant"
else if (x.exists(tc => tc.contains("coding_sequence_variant"))) "coding_sequence_variant"
else if (x.exists(tc => tc.contains("mature_miRNA_variant"))) "mature_miRNA_variant"
else if (x.exists(tc => tc.contains("5_prime_UTR_variant"))) "5_prime_UTR_variant" 
else if (x.exists(tc => tc.contains("3_prime_UTR_variant"))) "3_prime_UTR_variant"
else if (x.exists(tc => tc.contains("non_coding_transcript_exon_variant"))) "non_coding_transcript_exon_variant"
else if (x.exists(tc => tc.contains("intron_variant"))) "intron_variant"
else if (x.exists(tc => tc.contains("NMD_transcript_variant"))) "NMD_transcript_variant"
else if (x.exists(tc => tc.contains("non_coding_transcript_variant"))) "non_coding_transcript_variant"
else if (x.exists(tc => tc.contains("upstream_gene_variant"))) "upstream_gene_variant"
else if (x.exists(tc => tc.contains("downstream_gene_variant"))) "downstream_gene_variant"
else if (x.exists(tc => tc.contains("TFBS_ablation"))) "TFBS_ablation"
else if (x.exists(tc => tc.contains("TFBS_amplification"))) "TFBS_amplification"
else if (x.exists(tc => tc.contains("TF_binding_site_variant"))) "TF_binding_site_variant"
else if (x.exists(tc => tc.contains("regulatory_region_ablation"))) "regulatory_region_ablation"
else if (x.exists(tc => tc.contains("regulatory_region_amplification"))) "regulatory_region_amplification"
else if (x.exists(tc => tc.contains("feature_elongation"))) "feature_elongation"
else if (x.exists(tc => tc.contains("regulatory_region_variant"))) "regulatory_region_variant"
else if (x.exists(tc => tc.contains("feature_truncation"))) "feature_truncation"
else if (x.exists(tc => tc.contains("intergenic_variant"))) "intergenic_variant"
else va.vep.most_severe_consequence' \
annotatevariants expr -c 'va.gene = 
if (va.transcript_canonicals.exists(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq))) (va.transcript_canonicals.find(tc => tc.canonical == 1 && tc.consequence_terms.toSet.contains(va.most_severe_csq)).gene_symbol)
else (va.vep.transcript_consequences.find(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq)).gene_symbol)' \
annotatevariants expr -c 'va.LOFTEElof = 
if (va.transcript_canonicals.exists(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq))) (va.transcript_canonicals.find(tc => tc.canonical == 1 && tc.consequence_terms.toSet.contains(va.most_severe_csq)).lof)
else (va.vep.transcript_consequences.find(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq)).lof)' \
annotatevariants expr -c 'va.LOFTEElof_flags = 
if (va.transcript_canonicals.exists(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq))) (va.transcript_canonicals.find(tc => tc.canonical == 1 && tc.consequence_terms.toSet.contains(va.most_severe_csq)).lof)
else (va.vep.transcript_consequences.find(tc => tc.consequence_terms.toSet.contains(va.most_severe_csq)).lof_flags)' \
annotatevariants vds -i file:///mnt/lustre/mzekavat/dbNSFP/dbNSFP_3.2a_variant.filtered.allhg19_nodup.vds \
-r va.dbNSFP \
annotatevariants expr -c 'va.of9 = (if ("D" ~ va.dbNSFP.SIFT_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.PROVEAN_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.Polyphen2_HDIV_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.Polyphen2_HVAR_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.LRT_pred) 1 else 0) + (if ("H" ~ va.dbNSFP.MutationAssessor_pred || "M" ~ va.dbNSFP.MutationAssessor_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.MutationTaster_pred) 1 else 0) + (if ("D" ~ va.dbNSFP.FATHMM_pred) 1 else 0) + (if (va.dbNSFP.VEST3_score.split(";").map(v => v.toDouble).max >=0.75) 1 else 0)' \
annotatevariants expr -c 'va.LOF = (if (va.most_severe_csq == "frameshift_variant" || va.most_severe_csq == "transcript_ablation" || va.most_severe_csq == "splice_acceptor_variant" || va.most_severe_csq == "splice_donor_variant" || va.most_severe_csq == "stop_gained" || va.most_severe_csq == "start_lost") 1 else 0)' \
annotatevariants expr -c 'va.LOF_9of9 = (if (va.LOF == 1 || va.of9 == 9) 1 else 0), 
va.LOF_8of9 = (if (va.LOF == 1 || va.of9 >= 8) 1 else 0), 
va.LOF_7of9 = (if (va.LOF == 1 || va.of9 >= 7) 1 else 0),
va.LOF_MetaSVM = (if (va.LOF == 1 || ("D" ~ va.dbNSFP.MetaSVM_pred)) 1 else 0), 
va.Synonymous = (if (va.most_severe_csq == "synonymous_variant") 1 else 0), 
va.NonSynonymous = (if (va.most_severe_csq == "missense_variant" || va.most_severe_csq == "stop_gained" || va.most_severe_csq == "stop_lost" || va.most_severe_csq == "start_lost") 1 else 0)' \

I love it! This is the way to go.
You can simplify as follow:

("H" ~ va.dbNSFP.MutationAssessor_pred || "M" ~ va.dbNSFP.MutationAssessor_pred)
as
"[HM]" ~ va.dbNSFP.MutationAssessor_pred

It’s essentially pulling out the most severe, canonical consequence/gene

I like how detailed yours is. Mine would probably annotate consequence similarly, since it looks for coding consequences, and those are considered more severe than non-coding. I have the potential to pull out multiple terms (like missense_variant and splice_region_variant) together if the transcript I choose is annotated that way.