Running VEP in Hail with a specific version of GENCODE

Hi all,

I’m looking to annotate variant effects using VEP with a specific version of GENCODE. I’m successfully ran VEP with hailctl dataproc submit <dataproc name> <hail vep script> …, but I’m unsure how to go about including GTF and GFF files for a specific version of GENCODE. I would need to specify the GTF and GFF file paths in the VEP config file, but I’m not sure how to localize these files to the dataproc cluster VMs.

Is there a simple way to accomplish this? Any help would be appreciated. Thanks!

Hey @yakiyama !

It sounds like you want to annotate your matrix table with a GTF or GFF file. Do I understand that correctly?

You don’t need to use VEP to do that, Hail can natively annotate with any text file. If you have a GTF or GFF file from GENCODE, you can upload them to GCS like this:

gsutil cp /path/to/your/file.gtf gs://your_bucket/file.gtf

Then you can write a Hail script (which you can hailctl dataproc submit) to annotate with this file:

reference_name =
ht = hl.experimental.import_gtf(
mt = mt.annotate_rows(gencode = ht[])

Hi @danking ,

Thank you so much for your detailed response! I’ll try this out and let you know how it goes. Thanks again!

Hi @danking Thanks again for the help. It looks like export_vcf is having a hard time writing this information to a VCF file. I tried
vcf = vcf.annotate_rows(gencode=gtf_file[])
which gives the warning
Hail: WARN: export_vcf: ignored the following fields: 'gencode' (row)

vcf = vcf.annotate_rows(info = = gtf_file[])),

which shows HailException: VCF does not support SSubsetStruct(SSubsetStruct(SBaseStructPointer(PCStruct{source:PCString,feature:PCString,score:PFloat64,strand:PCString,frame:PInt32,tag:PCString,level:PCString,gene_id:PCString,gene_type:PCString,ccdsid:PCString,exon_id:PCString,exon_number:PCString,havana_gene:PCString,transcript_type:PCString,protein_id:PCString,gene_name:PCString,transcript_name:PCString,transcript_id:PCString,transcript_support_level:PCString,hgnc_id:PCString,ont:PCString,havana_transcript:PCString,interval:PCInterval[PCLocus(GRCh38)],__partNum:+PInt32}),Vector(source, feature, score, strand, frame, tag, level, gene_id, gene_type, ccdsid, exon_id, exon_number, havana_gene, transcript_type, protein_id, gene_name, transcript_name, transcript_id, transcript_support_level, hgnc_id, ont, havana_transcript, interval)),Vector(source, feature, score, strand, frame, tag, level, gene_id, gene_type, ccdsid, exon_id, exon_number, havana_gene, transcript_type, protein_id, gene_name, transcript_name, transcript_id, transcript_support_level, hgnc_id, ont, havana_transcript)).

Do you have any suggestions for how to circumvent this issue? Thanks again for all of your help.

The VCF format has limited support for complex data types. That said, this is an awful error message. Hail can and should verify the types of the info field and provide you with a meaningful error message and suggested fix. See the information at hl.export_vcf about the INFO field. You probably need to use .flatten.

Hi @danking ,

Thanks again for all of your help and suggestions. .flatten() seemed to work! Unfortunately, I wasn’t super clear in my initial question. I’m running VEP directly out of the box in Hail by running hail.vep(vcf, args.vep_config, csq=True) (I believe this is running VEP 95). It seems like this uses an older version of Gencode than Gencode v34, which is what the rest of my data corresponds to.

In order to get the appropriate annotations for non-synonymous variants (missense, indels, etc.), I’d like for these annotations (amino acid change/position) to correspond to Gencode v34 annotations. For example, in my current results, I see that the variant chr1_2512975_G_A_b38 is annotated with PANK4|Q9NVE7|p.Ala555Val|555|missense_variant|ENSP00000367727 (reformatted), but with Gencode v34, it would show p.Ala547Val. In order to do this, it looks like I need to use Ensembl v100 (and thus VEP v100 rather than v95).

What would be the best way to accomplish this using Hail?


Adding a new version of VEP to Hail is rather complicated. It involves creating a new docker image that has VEP correctly installed (a feat in itself) and it has to deal with the fact that VEP silently deletes star alleles.

Can you use an old version of gencode instead?

Thanks for the additional information. Since I’m trying to map to molecular phenotype data from another group, it’s not possible to use an older version of gencode. I see that the docker image is specified in along with necessary reference (vep cache) files. Would replacing these with an appropriate docker image and reference files be sufficient? I see that ensembl dockerizes all of their releases, but I presume maybe there are issues with them that I am not familiar with (Docker).

If this is sufficient, what is the best way to change the cluster initialization to use this alternative script?