Sample wise VEP annotation for Rare Variant Disease(Exome) vcf files through hail with my custom databases

krishnajandhyala · June 8, 2023, 5:37am

My name is Krishna, and I am currently working as a project student in the field of Clinical Genetics, specifically the rare variant disease study of patient exome and whole genome samples.

I have successfully generated VCF files for each patient sample following the GATK best practices workflow, which includes adapter marking, alignment, duplicate marking, base quality score recalibration (BQSR), HaplotypeCaller, and hard filtering.

I am able to annotate these VCF files individually using the VEP docker container with custom databases available, such as Gnomad exomes, Gnomad genomes, Cosmic Mutations, dbSNP, CADD, and others. These databases are relevant for both Grch37 (cache v106) and Grch38 (cache v106) reference genomes.

Now I am interested in annotating these VCF files using the VEP docker container (Variant Effect Predictor) through Hail for scalability purposes.

Although I have reviewed the documentation on using VEP with Hail, I found it difficult to understand and it left me confused. I kindly request your assistance in clarifying the process and guiding me through it.

Thank you in advance for your help.

danking · June 8, 2023, 12:51pm

Using VEP through Hail is not easy (or, really, scalable) unless you’re using a cloud Spark cluster like Google Cloud Dataproc. Do you have access to that?

krishnajandhyala · June 9, 2023, 3:09am

Thank you for your response, @danking.

Currently, I do not have access to a cloud-based Spark cluster. However, I can obtain access to a basic cluster to set up Hail VEP for initial testing purposes. Later on, I plan to scale up to a larger cluster.

I would appreciate your guidance regarding the HAIL VEP setup and the most suitable cloud service in terms of pricing and ease of access for my specific use case scenario.

Thanks in advance.

danking · June 9, 2023, 12:29pm

I recommend Google Cloud Dataproc. See Hail | Google Cloud Platform

krishnajandhyala · June 10, 2023, 6:11am

Thank you for your response @danking

I have carefully reviewed the documentation, but as I am not proficient in programming, I find myself facing several difficulties regarding Hail VEP. It would be immensely helpful if you could provide me with a tutorial that specifically addresses the annotation of monogenic rare disease study sample VCF files using Hail VEP.

Alternatively, I would greatly appreciate it if you could arrange a brief paid live session to guide me through the process of using Hail VEP.

Thank you in advance for your kind assistance.

danking · June 10, 2023, 5:47pm

I’m sorry, we do not have the capacity to provide this nature of support. We’re just a small team at a non-profit. Hail is fundamentally a programming language library. You might be better served by a company like DNANeuxs, BC Platforms, or Seven Bridges.

krishnajandhyala · June 11, 2023, 12:51am

@danking Thank you for the response.

I will try to reach them out.

Thanks again!

krishnajandhyala · June 12, 2023, 9:48am

Hello @danking,

I hope this message finds you well. I wanted to reach out regarding an issue I encountered while using HAIL VEP on my local server.

Issue: I have successfully run HAIL VEP with the specified parameters and custom database files. However, the output I am receiving is in JSON format, and as a result, the custom databases are not being displayed.

Here are the commands I used:

import hail as hl
hl.init()

mt = hl.import_vcf("/path-to-vcf/sample.vcf"reference_genome=“GRCh37”, contig_recoding={**{f’chr{x}': str(x) for x in range(1, 23)},
‘chrX’: ‘X’,
‘chrY’: ‘Y’,
‘chrM’: ‘MT’})
mt = hl.vep(mt, “/path-to-json/vep_config_GRCh37_JSON.json”)

For your reference, I have attached the JSON file I used. (Please note that I had to rename the file extension from .json to .txt due to an error during uploading).

vep_config_GRCh37_JSON.txt (5.0 KB)

The current output is a JSON file with three entities: “locus,” “alleles,” and “vep.”

However, I would like the output to be in the form of a tab-delimited file. When I specify the --tab parameter in the VEP command, I receive an error stating that it cannot be used for both JSON and tabular output simultaneously.

I would greatly appreciate your assistance in resolving this issue.

Thank you in advance for your help.

danking · June 12, 2023, 1:59pm

Hail can only read the JSON output of VEP. It doesn’t support the TSV output.

If you want to produce a TSV from your vep annotations you can try this:

ht = mt.rows()
ht = ht.select('vep')
ht = ht.flatten()
ht.export("foo.tsv")

However, note that this will still have JSON output in some columns. VEP’s output is necessarily structured data: there are a variable number of transcripts at each site and each transcript has several pieces of metadata.

If you just want to execute VEP to produce a TSV from your VCF, then I don’t think Hail is a good tool for you. You should install VEP and execute it directly.

krishnajandhyala · June 12, 2023, 4:01pm

@danking Thank you so much for your guidance and support.

Topic		Replies	Views
Running VEP in Hail with a specific version of GENCODE Hail Query & hailctl	7	470	March 31, 2023
How to add annotation to a VCF file? Hail Query & hailctl	3	544	September 7, 2020
How to add annotations to a vcf file? Hail Query & hailctl	1	624	October 31, 2018
What is the setup required to perform annotations using vep? Hail Query & hailctl	0	617	October 30, 2018
VEP Annotation stalling Hail Query & hailctl	0	29	May 9, 2025

Sample wise VEP annotation for Rare Variant Disease(Exome) vcf files through hail with my custom databases

Related topics