I’m using Spark 2.2.1 with Hail 0.2, and trying to use VEP to annotate some sample records from ClinVar. These are for GRCh38, and I am extending the VEP 92 Docker image (ensemblorg/ensembl-vep:release_92.1). This means I’ve had to use ReferenceGenome.from_fasta_file to create a reference genome as the chromosomes in ClinVar are named 1, 2, … instead of chr1, chr2, … as in the out of the box Hail GRCh38.
However, when I import the ClinVar VCF records and try to annotate them, I get a lot of warnings like:
Hail: WARN: Can't convert JSON value JArray(List(JString(UPI000022DAF4))) to type str at <root>.transcript_consequences.<array>.uniparc
and
Hail: WARN: struct{allele_num: int32, amino_acids: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, uniparc: str, variant_allele: str} has no field appris at <root>.transcript_consequences.<array>
The annotations are not added as expected.
When I run the VEP in the container in the same way (based on the information in the vep.properties file and the VEP Invocation section of VEP method docs), I get something like:
{"minimised":1,"transcript_consequences":[{"impact":"MODIFIER","swissprot":["P01024"],"consequence_terms":["downstream_gene_variant"],"gene_symbol_source":"HGNC","protein_id":"ENSP00000245907","biotype":"protein_coding","trembl":["V9HWA9"],"gene_id":"ENSG00000125730","gene_symbol":"C3","variant_allele":"C","allele_num":1,"tsl":1,"transcript_id":"ENST00000245907","hgnc_id":"HGNC:1318","distance":1995,"strand":-1,"canonical":1,"ccds":"CCDS32883.1","uniparc":["UPI000013EC9B"],"appris":"P1","gene_pheno":1},{"variant_allele":"C","gene_symbol":"C3","allele_num":1,"tsl":3,"flags":["cds_start_NF","cds_end_NF"],"gene_id":"ENSG00000125730","gene_symbol_source":"HGNC","protein_id":"ENSP00000469744","biotype":"protein_coding","trembl":["M0QYC8"],"consequence_terms":["downstream_gene_variant"],"impact":"MODIFIER","gene_pheno":1,"uniparc":["UPI0002A471D3"],"strand":-1,"transcript_id":"ENST00000596548","distance":4450,"hgnc_id":"HGNC:1318"},{"strand":-1,"impact":"MODIFIER","transcript_id":"ENST00000599668","hgnc_id":"HGNC:1318","distance":2158,"gene_id":"ENSG00000125730","gene_symbol":"C3","gene_pheno":1,"variant_allele":"C","allele_num":1,"tsl":3,"consequence_terms":["downstream_gene_variant"],"biotype":"processed_transcript","gene_symbol_source":"HGNC"},{"biotype":"retained_intron","gene_symbol_source":"HGNC","consequence_terms":["downstream_gene_variant"],"gene_symbol":"C3","variant_allele":"C","gene_pheno":1,"allele_num":1,"tsl":2,"gene_id":"ENSG00000125730","transcript_id":"ENST00000599899","distance":2132,"hgnc_id":"HGNC:1318","impact":"MODIFIER","strand":-1}
(output truncated)
It looks like Hail might be expecting only a single string rather than a list of them (e.g. for Uniparc above). Is this why this fails? How should VEP be used with Hail?