Sample wise VEP annotation for Rare Variant Disease(Exome) vcf files through hail with my custom databases

My name is Krishna, and I am currently working as a project student in the field of Clinical Genetics, specifically the rare variant disease study of patient exome and whole genome samples.

I have successfully generated VCF files for each patient sample following the GATK best practices workflow, which includes adapter marking, alignment, duplicate marking, base quality score recalibration (BQSR), HaplotypeCaller, and hard filtering.

I am able to annotate these VCF files individually using the VEP docker container with custom databases available, such as Gnomad exomes, Gnomad genomes, Cosmic Mutations, dbSNP, CADD, and others. These databases are relevant for both Grch37 (cache v106) and Grch38 (cache v106) reference genomes.

Now I am interested in annotating these VCF files using the VEP docker container (Variant Effect Predictor) through Hail for scalability purposes.

Although I have reviewed the documentation on using VEP with Hail, I found it difficult to understand and it left me confused. I kindly request your assistance in clarifying the process and guiding me through it.

Thank you in advance for your help.

Using VEP through Hail is not easy (or, really, scalable) unless you’re using a cloud Spark cluster like Google Cloud Dataproc. Do you have access to that?

Thank you for your response, @danking.

Currently, I do not have access to a cloud-based Spark cluster. However, I can obtain access to a basic cluster to set up Hail VEP for initial testing purposes. Later on, I plan to scale up to a larger cluster.

I would appreciate your guidance regarding the HAIL VEP setup and the most suitable cloud service in terms of pricing and ease of access for my specific use case scenario.

Thanks in advance.

I recommend Google Cloud Dataproc. See Hail | Google Cloud Platform

Thank you for your response @danking

I have carefully reviewed the documentation, but as I am not proficient in programming, I find myself facing several difficulties regarding Hail VEP. It would be immensely helpful if you could provide me with a tutorial that specifically addresses the annotation of monogenic rare disease study sample VCF files using Hail VEP.

Alternatively, I would greatly appreciate it if you could arrange a brief paid live session to guide me through the process of using Hail VEP.

Thank you in advance for your kind assistance.

I’m sorry, we do not have the capacity to provide this nature of support. We’re just a small team at a non-profit. Hail is fundamentally a programming language library. You might be better served by a company like DNANeuxs, BC Platforms, or Seven Bridges.

@danking Thank you for the response.

I will try to reach them out.

Thanks again!

Hello @danking,

I hope this message finds you well. I wanted to reach out regarding an issue I encountered while using HAIL VEP on my local server.

Issue: I have successfully run HAIL VEP with the specified parameters and custom database files. However, the output I am receiving is in JSON format, and as a result, the custom databases are not being displayed.

Here are the commands I used:

import hail as hl
hl.init()

mt = hl.import_vcf("/path-to-vcf/sample.vcf"reference_genome=“GRCh37”, contig_recoding={**{f’chr{x}': str(x) for x in range(1, 23)},
‘chrX’: ‘X’,
‘chrY’: ‘Y’,
‘chrM’: ‘MT’})
mt = hl.vep(mt, “/path-to-json/vep_config_GRCh37_JSON.json”)

For your reference, I have attached the JSON file I used. (Please note that I had to rename the file extension from .json to .txt due to an error during uploading).

vep_config_GRCh37_JSON.txt (5.0 KB)

The current output is a JSON file with three entities: “locus,” “alleles,” and “vep.”

However, I would like the output to be in the form of a tab-delimited file. When I specify the --tab parameter in the VEP command, I receive an error stating that it cannot be used for both JSON and tabular output simultaneously.

I would greatly appreciate your assistance in resolving this issue.

Thank you in advance for your help.

Hail can only read the JSON output of VEP. It doesn’t support the TSV output.

If you want to produce a TSV from your vep annotations you can try this:

ht = mt.rows()
ht = ht.select('vep')
ht = ht.flatten()
ht.export("foo.tsv")

However, note that this will still have JSON output in some columns. VEP’s output is necessarily structured data: there are a variable number of transcripts at each site and each transcript has several pieces of metadata.

If you just want to execute VEP to produce a TSV from your VCF, then I don’t think Hail is a good tool for you. You should install VEP and execute it directly.

@danking Thank you so much for your guidance and support.