Import existing VEP annotations from vcf or CSQ

Hello, I am pretty new to hail, so please excuse if I might ask some basic questions.
I would like to use existing CSQ annotations from a vcf which have been created before importing it to hail. Is there some script to smoothly extract the CSQ information in a row field, as hl.vep would do?

I don’t know of one. Dealing with the VCF CSQ field is a nightmare.

It’s probably worth it to re-vep in Hail, if you’re running on the cloud.

Thanks for the instant reply. Okay, will keep trying. So far I had some issues running it. It starts to create the respective row fields, but then it stops before adding any annotation. The last line of the log is:

2019-10-09 14:23:26 root: INFO: is/hail/codegen/generated/C_etypeEncode_78.ENCODE_o_tuple_of_END_TO_o_struct_of_END_1 instruction count: 7

any take on this ?

what is your pipeline?

Just figured out that customized VEP runs (local) are probably an issue. Probably due to configuration of the vep_json_schema as I suppose. Maybe a solution is to import annotation (–tab output from VEP) as extra table and use the variant identifier like chrom_pos_ref_alt for merging.

Are you using Hail on GCP Dataproc? The hailctl utility handles all VEP configuration for you.

No, I want to (have to) use it locally.
I’ve now imported output from local vep as table along with an import of the vcf.
The tab output from vep only provides variant annotation but no genotypes, so this is why I have to also import the vcf.

hl.import_vcf(‘VEP.vcf’,reference_genome=‘hg38’).write(‘data/test.mt’, overwrite=True)
mt = hl.read_matrix_table(‘data/test.mt’)

mtanno=hl.import_table(‘VEP.txt’)
mtanno = mtanno.key_by(‘Uploaded_variation’)

Now I need to create a shared identifier from “mt” to merge with “mtanno” key, which looks like this:

“1_100_C/G”
“1_1000_AA/-”

Any suggestions on how to do this ? Thanx !

That doesn’t look like a min-repped, left aligned variant, which will create problems for joining against your dataset, if your dataset is min-repped and left aligned.

What does mt.filter_rows(mt.locus == hl.locus("1", 1000)).alleles.show() print?

Are you sure the VEP output doesn’t have a standard variant format like 1:100:A:T? We have parsers for that. If not, you’ll need to hack something yourself with split.