Import existing VEP annotations from vcf or CSQ

johnnyr · October 9, 2019, 12:31pm

Hello, I am pretty new to hail, so please excuse if I might ask some basic questions.
I would like to use existing CSQ annotations from a vcf which have been created before importing it to hail. Is there some script to smoothly extract the CSQ information in a row field, as hl.vep would do?

tpoterba · October 9, 2019, 12:32pm

I don’t know of one. Dealing with the VCF CSQ field is a nightmare.

It’s probably worth it to re-vep in Hail, if you’re running on the cloud.

johnnyr · October 9, 2019, 12:44pm

Thanks for the instant reply. Okay, will keep trying. So far I had some issues running it. It starts to create the respective row fields, but then it stops before adding any annotation. The last line of the log is:

2019-10-09 14:23:26 root: INFO: is/hail/codegen/generated/C_etypeEncode_78.ENCODE_o_tuple_of_END_TO_o_struct_of_END_1 instruction count: 7

any take on this ?

tpoterba · October 9, 2019, 12:46pm

what is your pipeline?

johnnyr · October 9, 2019, 3:35pm

Just figured out that customized VEP runs (local) are probably an issue. Probably due to configuration of the vep_json_schema as I suppose. Maybe a solution is to import annotation (–tab output from VEP) as extra table and use the variant identifier like chrom_pos_ref_alt for merging.

danking · October 10, 2019, 5:30pm

Are you using Hail on GCP Dataproc? The hailctl utility handles all VEP configuration for you.

johnnyr · October 18, 2019, 2:43pm

No, I want to (have to) use it locally.
I’ve now imported output from local vep as table along with an import of the vcf.
The tab output from vep only provides variant annotation but no genotypes, so this is why I have to also import the vcf.

hl.import_vcf(‘VEP.vcf’,reference_genome=‘hg38’).write(‘data/test.mt’, overwrite=True)
mt = hl.read_matrix_table(‘data/test.mt’)

mtanno=hl.import_table(‘VEP.txt’)
mtanno = mtanno.key_by(‘Uploaded_variation’)

Now I need to create a shared identifier from “mt” to merge with “mtanno” key, which looks like this:

“1_100_C/G”
“1_1000_AA/-”

Any suggestions on how to do this ? Thanx !

danking · October 18, 2019, 2:48pm

That doesn’t look like a min-repped, left aligned variant, which will create problems for joining against your dataset, if your dataset is min-repped and left aligned.

What does mt.filter_rows(mt.locus == hl.locus("1", 1000)).alleles.show() print?

Are you sure the VEP output doesn’t have a standard variant format like 1:100:A:T? We have parsers for that. If not, you’ll need to hack something yourself with split.

toandd · November 19, 2019, 2:53am

Same my problem. Did you find out some solutions @johnnyr?

johnnyr · November 20, 2019, 2:14pm

Unfortunately I had no time to work on it recently, but will look at it again. Will update you once I have found a solution. If you find a good one meanwhile, let me know.

johnnyr · November 27, 2019, 8:35am

Okay, here is what I can come up with so far.
When performing GATK “LeftAlignAndTrimVariants” I include the option “–split-multi-allelics” to get one line each per alternate allele of a variant.
Next, I have annotated a vcf using ensemble-VEP including “–pick” flag to restrict annotations to only the major consequence. Of course this is a critical point and one might consider alternative prefiltering.
I run VEP twice to get a vcf as output “–vcf” and with “–tab” (.txt) option again, to also get VEP annotations in a tabular format, inluding a column named “Uploaded_variation”. This column serves as key for annotation in hail.
To get the very same key in hail for the vcf, I used a workaround in R with VariantAnnotation tool, to add this key as a seperate entry in the INFO field of the vcf.

#!/usr/bin/env Rscript
library(VariantAnnotation)
samples<-c(“ID1”,“ID2”)
param<-ScanVcfParam(geno=c(“GT”), info=c(“CSQ”), samples = samples)
vcf ← readVcf(“path to VEP annotated vcf”, “GRCh38”, param)
VARID ← DataFrame(Number=1, Type=“String”,
Description=“VARID for annotation in hail”,
row.names=“VARID”)
info(header(vcf)) ← rbind(info(header(vcf)), VARID)
info(vcf)$VARID<-gsub(“\:”,“_”,as.matrix(rownames(vcf)))
writeVcf(vcf,“path to VEP annotated vcf with VARID annotation in info field”,index=F)

Next, I import both into hail:

hl.import_vcf(‘path to VEP annotated vcf with VARID annotation in info field’,reference_genome=‘hg38’).write(‘data/test.mt’, overwrite=True)
mt = hl.read_matrix_table(‘data/test.mt’)
mt = mt.key_rows_by(mt.info.VARID)
mtanno=hl.import_table(‘VEP.txt’)
mtanno = mtanno.key_by(‘Uploaded_variation’)

and finally, I am able to annotate the matrix table in hail with my annotation table:

mt = mt.annotate_rows(VEP = mtanno[mt.info.VARID])

When performing this for the full “tab” output from VEP, I had some problems, probably due to incorrect field types. So I have now tested only with one annotation column (String), which works nicely.
I haven’t performed in depth quality controls if all annotations got right, so treat with care.

Probably there are more elegant solutions in hail, which I would love to test, also.
So any comments are welcome.

Topic		Replies	Views
Vep runs successfully, but doesn't output CSQ even with csq=True Help [0.1]	2	799	February 15, 2017
Export_vcf with VEP Hail Query & hailctl	8	1075	May 12, 2021
How to parse CSQ (VEP) field inside Hail 0.2? Hail Query & hailctl	17	2005	February 14, 2023
Best strategy for annotating and filtering VCF files using HAIL-VEP on UKB RAP? Hail Query & hailctl	3	1483	June 17, 2022
Running VEP in Hail with a specific version of GENCODE Hail Query & hailctl	7	470	March 31, 2023

Import existing VEP annotations from vcf or CSQ

Related topics