Import existing VEP annotations from vcf or CSQ

Hello, I am pretty new to hail, so please excuse if I might ask some basic questions.
I would like to use existing CSQ annotations from a vcf which have been created before importing it to hail. Is there some script to smoothly extract the CSQ information in a row field, as hl.vep would do?

I don’t know of one. Dealing with the VCF CSQ field is a nightmare.

It’s probably worth it to re-vep in Hail, if you’re running on the cloud.

Thanks for the instant reply. Okay, will keep trying. So far I had some issues running it. It starts to create the respective row fields, but then it stops before adding any annotation. The last line of the log is:

2019-10-09 14:23:26 root: INFO: is/hail/codegen/generated/C_etypeEncode_78.ENCODE_o_tuple_of_END_TO_o_struct_of_END_1 instruction count: 7

any take on this ?

what is your pipeline?

Just figured out that customized VEP runs (local) are probably an issue. Probably due to configuration of the vep_json_schema as I suppose. Maybe a solution is to import annotation (–tab output from VEP) as extra table and use the variant identifier like chrom_pos_ref_alt for merging.

Are you using Hail on GCP Dataproc? The hailctl utility handles all VEP configuration for you.

No, I want to (have to) use it locally.
I’ve now imported output from local vep as table along with an import of the vcf.
The tab output from vep only provides variant annotation but no genotypes, so this is why I have to also import the vcf.

hl.import_vcf(‘VEP.vcf’,reference_genome=‘hg38’).write(‘data/test.mt’, overwrite=True)
mt = hl.read_matrix_table(‘data/test.mt’)

mtanno=hl.import_table(‘VEP.txt’)
mtanno = mtanno.key_by(‘Uploaded_variation’)

Now I need to create a shared identifier from “mt” to merge with “mtanno” key, which looks like this:

“1_100_C/G”
“1_1000_AA/-”

Any suggestions on how to do this ? Thanx !

That doesn’t look like a min-repped, left aligned variant, which will create problems for joining against your dataset, if your dataset is min-repped and left aligned.

What does mt.filter_rows(mt.locus == hl.locus("1", 1000)).alleles.show() print?

Are you sure the VEP output doesn’t have a standard variant format like 1:100:A:T? We have parsers for that. If not, you’ll need to hack something yourself with split.

Same my problem. Did you find out some solutions @johnnyr?

Unfortunately I had no time to work on it recently, but will look at it again. Will update you once I have found a solution. If you find a good one meanwhile, let me know.

Okay, here is what I can come up with so far.
When performing GATK “LeftAlignAndTrimVariants” I include the option “–split-multi-allelics” to get one line each per alternate allele of a variant.
Next, I have annotated a vcf using ensemble-VEP including “–pick” flag to restrict annotations to only the major consequence. Of course this is a critical point and one might consider alternative prefiltering.
I run VEP twice to get a vcf as output “–vcf” and with “–tab” (.txt) option again, to also get VEP annotations in a tabular format, inluding a column named “Uploaded_variation”. This column serves as key for annotation in hail.
To get the very same key in hail for the vcf, I used a workaround in R with VariantAnnotation tool, to add this key as a seperate entry in the INFO field of the vcf.

#!/usr/bin/env Rscript
library(VariantAnnotation)
samples<-c(“ID1”,“ID2”)
param<-ScanVcfParam(geno=c(“GT”), info=c(“CSQ”), samples = samples)
vcf ← readVcf(“path to VEP annotated vcf”, “GRCh38”, param)
VARID ← DataFrame(Number=1, Type=“String”,
Description=“VARID for annotation in hail”,
row.names=“VARID”)
info(header(vcf)) ← rbind(info(header(vcf)), VARID)
info(vcf)$VARID<-gsub(“\:”,“_”,as.matrix(rownames(vcf)))
writeVcf(vcf,“path to VEP annotated vcf with VARID annotation in info field”,index=F)

Next, I import both into hail:

hl.import_vcf(‘path to VEP annotated vcf with VARID annotation in info field’,reference_genome=‘hg38’).write(‘data/test.mt’, overwrite=True)
mt = hl.read_matrix_table(‘data/test.mt’)
mt = mt.key_rows_by(mt.info.VARID)
mtanno=hl.import_table(‘VEP.txt’)
mtanno = mtanno.key_by(‘Uploaded_variation’)

and finally, I am able to annotate the matrix table in hail with my annotation table:

mt = mt.annotate_rows(VEP = mtanno[mt.info.VARID])

When performing this for the full “tab” output from VEP, I had some problems, probably due to incorrect field types. So I have now tested only with one annotation column (String), which works nicely.
I haven’t performed in depth quality controls if all annotations got right, so treat with care.

Probably there are more elegant solutions in hail, which I would love to test, also.
So any comments are welcome.

1 Like