Hello, I am pretty new to hail, so please excuse if I might ask some basic questions.
I would like to use existing CSQ annotations from a vcf which have been created before importing it to hail. Is there some script to smoothly extract the CSQ information in a row field, as hl.vep would do?
I donât know of one. Dealing with the VCF CSQ field is a nightmare.
Itâs probably worth it to re-vep in Hail, if youâre running on the cloud.
Thanks for the instant reply. Okay, will keep trying. So far I had some issues running it. It starts to create the respective row fields, but then it stops before adding any annotation. The last line of the log is:
2019-10-09 14:23:26 root: INFO: is/hail/codegen/generated/C_etypeEncode_78.ENCODE_o_tuple_of_END_TO_o_struct_of_END_1 instruction count: 7
any take on this ?
what is your pipeline?
Just figured out that customized VEP runs (local) are probably an issue. Probably due to configuration of the vep_json_schema as I suppose. Maybe a solution is to import annotation (âtab output from VEP) as extra table and use the variant identifier like chrom_pos_ref_alt for merging.
Are you using Hail on GCP Dataproc? The hailctl
utility handles all VEP configuration for you.
No, I want to (have to) use it locally.
Iâve now imported output from local vep as table along with an import of the vcf.
The tab output from vep only provides variant annotation but no genotypes, so this is why I have to also import the vcf.
hl.import_vcf(âVEP.vcfâ,reference_genome=âhg38â).write(âdata/test.mtâ, overwrite=True)
mt = hl.read_matrix_table(âdata/test.mtâ)
mtanno=hl.import_table(âVEP.txtâ)
mtanno = mtanno.key_by(âUploaded_variationâ)
Now I need to create a shared identifier from âmtâ to merge with âmtannoâ key, which looks like this:
â1_100_C/Gâ
â1_1000_AA/-â
Any suggestions on how to do this ? Thanx !
That doesnât look like a min-repped, left aligned variant, which will create problems for joining against your dataset, if your dataset is min-repped and left aligned.
What does mt.filter_rows(mt.locus == hl.locus("1", 1000)).alleles.show()
print?
Are you sure the VEP output doesnât have a standard variant format like 1:100:A:T
? We have parsers for that. If not, youâll need to hack something yourself with split
.
Same my problem. Did you find out some solutions @johnnyr?
Unfortunately I had no time to work on it recently, but will look at it again. Will update you once I have found a solution. If you find a good one meanwhile, let me know.
Okay, here is what I can come up with so far.
When performing GATK âLeftAlignAndTrimVariantsâ I include the option ââsplit-multi-allelicsâ to get one line each per alternate allele of a variant.
Next, I have annotated a vcf using ensemble-VEP including ââpickâ flag to restrict annotations to only the major consequence. Of course this is a critical point and one might consider alternative prefiltering.
I run VEP twice to get a vcf as output ââvcfâ and with ââtabâ (.txt) option again, to also get VEP annotations in a tabular format, inluding a column named âUploaded_variationâ. This column serves as key for annotation in hail.
To get the very same key in hail for the vcf, I used a workaround in R with VariantAnnotation tool, to add this key as a seperate entry in the INFO field of the vcf.
#!/usr/bin/env Rscript
library(VariantAnnotation)
samples<-c(âID1â,âID2â)
param<-ScanVcfParam(geno=c(âGTâ), info=c(âCSQâ), samples = samples)
vcf â readVcf(âpath to VEP annotated vcfâ, âGRCh38â, param)
VARID â DataFrame(Number=1, Type=âStringâ,
Description=âVARID for annotation in hailâ,
row.names=âVARIDâ)
info(header(vcf)) â rbind(info(header(vcf)), VARID)
info(vcf)$VARID<-gsub(â\:â,â_â,as.matrix(rownames(vcf)))
writeVcf(vcf,âpath to VEP annotated vcf with VARID annotation in info fieldâ,index=F)
Next, I import both into hail:
hl.import_vcf(âpath to VEP annotated vcf with VARID annotation in info fieldâ,reference_genome=âhg38â).write(âdata/test.mtâ, overwrite=True)
mt = hl.read_matrix_table(âdata/test.mtâ)
mt = mt.key_rows_by(mt.info.VARID)
mtanno=hl.import_table(âVEP.txtâ)
mtanno = mtanno.key_by(âUploaded_variationâ)
and finally, I am able to annotate the matrix table in hail with my annotation table:
mt = mt.annotate_rows(VEP = mtanno[mt.info.VARID])
When performing this for the full âtabâ output from VEP, I had some problems, probably due to incorrect field types. So I have now tested only with one annotation column (String), which works nicely.
I havenât performed in depth quality controls if all annotations got right, so treat with care.
Probably there are more elegant solutions in hail, which I would love to test, also.
So any comments are welcome.