VEP fails to generate annotations

I’m using Spark 2.2.1 with Hail 0.2, and trying to use VEP to annotate some sample records from ClinVar. These are for GRCh38, and I am extending the VEP 92 Docker image (ensemblorg/ensembl-vep:release_92.1). This means I’ve had to use ReferenceGenome.from_fasta_file to create a reference genome as the chromosomes in ClinVar are named 1, 2, … instead of chr1, chr2, … as in the out of the box Hail GRCh38.

However, when I import the ClinVar VCF records and try to annotate them, I get a lot of warnings like:
Hail: WARN: Can't convert JSON value JArray(List(JString(UPI000022DAF4))) to type str at <root>.transcript_consequences.<array>.uniparc
and
Hail: WARN: struct{allele_num: int32, amino_acids: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, uniparc: str, variant_allele: str} has no field appris at <root>.transcript_consequences.<array>
The annotations are not added as expected.

When I run the VEP in the container in the same way (based on the information in the vep.properties file and the VEP Invocation section of VEP method docs), I get something like:
{"minimised":1,"transcript_consequences":[{"impact":"MODIFIER","swissprot":["P01024"],"consequence_terms":["downstream_gene_variant"],"gene_symbol_source":"HGNC","protein_id":"ENSP00000245907","biotype":"protein_coding","trembl":["V9HWA9"],"gene_id":"ENSG00000125730","gene_symbol":"C3","variant_allele":"C","allele_num":1,"tsl":1,"transcript_id":"ENST00000245907","hgnc_id":"HGNC:1318","distance":1995,"strand":-1,"canonical":1,"ccds":"CCDS32883.1","uniparc":["UPI000013EC9B"],"appris":"P1","gene_pheno":1},{"variant_allele":"C","gene_symbol":"C3","allele_num":1,"tsl":3,"flags":["cds_start_NF","cds_end_NF"],"gene_id":"ENSG00000125730","gene_symbol_source":"HGNC","protein_id":"ENSP00000469744","biotype":"protein_coding","trembl":["M0QYC8"],"consequence_terms":["downstream_gene_variant"],"impact":"MODIFIER","gene_pheno":1,"uniparc":["UPI0002A471D3"],"strand":-1,"transcript_id":"ENST00000596548","distance":4450,"hgnc_id":"HGNC:1318"},{"strand":-1,"impact":"MODIFIER","transcript_id":"ENST00000599668","hgnc_id":"HGNC:1318","distance":2158,"gene_id":"ENSG00000125730","gene_symbol":"C3","gene_pheno":1,"variant_allele":"C","allele_num":1,"tsl":3,"consequence_terms":["downstream_gene_variant"],"biotype":"processed_transcript","gene_symbol_source":"HGNC"},{"biotype":"retained_intron","gene_symbol_source":"HGNC","consequence_terms":["downstream_gene_variant"],"gene_symbol":"C3","variant_allele":"C","gene_pheno":1,"allele_num":1,"tsl":2,"gene_id":"ENSG00000125730","transcript_id":"ENST00000599899","distance":2132,"hgnc_id":"HGNC:1318","impact":"MODIFIER","strand":-1}
(output truncated)

It looks like Hail might be expecting only a single string rather than a list of them (e.g. for Uniparc above). Is this why this fails? How should VEP be used with Hail?

Hail’s VEP schema is hardcoded, and won’t work for versions significantly different from the one it’s built for (which I believe is 85).

We should document the version that the schema is written for, for sure.

@konradjk you know more – is this accurate?

Thanks for the speedy response! It’d definitely be very helpful to have the compatible versions of VEP documented on that reference page.

Is there any intention to support newer versions?

We definitely want to support as much as possible, but to support versions with different schemas might involve somehow parameterizing the schema (maybe in the VEP config?). That could involve very hostile interfaces, given the size of the VEP schema!

If multiple schema versions were to be supported, I think you’re right that it would have to be an option in the VEP properties file. I sympathize that could be a lot of work though, and would significantly increase the amount of testing required.

For the moment, having the allowed VEP versions documented would be great.

Can do that much more easily – is it a big pain for you to move to 85?

I think it should be possible, although it would be quite nice to be able to use ensembl-vep rather than ensembl-tools. Checking https://github.com/Ensembl/ensembl-vep/releases, it looks like the earliest release is 87.

87 could work. I’m not sure how to evaluate that, though, other than by trying it :frowning:

(I think the schema creation was a manual process in the first place, involving looking at the output!)

I’ve had a chance to look, and 87 seems to work which is good!

I’m having some issues with LOFTEE though:

Failed to instantiate plugin LoF: Can't open /humgen/atgu1/fs03/birnbaum/loftee-dev/splice_data/donor_motifs/ese.txt: No such file or directory at /root/.vep/loftee/loftee_splice_utils.pl line 212

The issue is that the LOFTEE has some data files which aren’t resolved properly if the loftee_path parameter is not passed to the plugin (see LOFTEE usage). This seems to be a requirement for LOFTEE now – https://github.com/konradjk/loftee/issues/24, though perhaps it wasn’t in an earlier version. Would you be open to making this an additional property that could be set in the vep.properties file? Do you know which version of LOFTEE the VEP annotation was originally implemented to work with?

summoning @konradjk

Best to use the most recent versioned LOFTEE: https://github.com/konradjk/loftee/tree/v0.3-beta

Sorry, I’ve been away on vacation then busy with other things!

Thanks both for your heIp with this. For my workflow it would be good to run VEP with the csq parameter set to true. That unfortunately causes an exception after VEP has run. Am I using this wrong, or is this something I should raise an issue for on GitHub?

2018-06-19 14:35:22 Hail: INFO: vep: annotated 101 variants
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/hail/python/hail/typecheck/check.py", line 547, in wrapper
    return f(*args_, **kwargs_)
  File "/opt/hail/python/hail/table.py", line 1195, in show
    print(self._show(n,width, truncate, types))
  File "/opt/hail/python/hail/table.py", line 1198, in _show
    return self._jt.showString(n, joption(truncate), types, width)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/hail/python/hail/utils/java.py", line 196, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: MatchError: [Ljava.lang.String;@11e39d67 (of class [Ljava.lang.String;)

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 194.0 failed 1 times, most recent failure: Lost task 0.0 in stage 194.0 (TID 207, localhost, executor driver): scala.MatchError: [Ljava.lang.String;@11e39d67 (of class [Ljava.lang.String;)
	at is.hail.annotations.RegionValueBuilder.addAnnotation(RegionValueBuilder.scala:479)
	at is.hail.methods.VEP$$anonfun$9$$anonfun$apply$4.apply(VEP.scala:350)
	at is.hail.methods.VEP$$anonfun$9$$anonfun$apply$4.apply(VEP.scala:345)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.next(OrderedRVD.scala:926)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.next(OrderedRVD.scala:920)
	at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
	at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at is.hail.rvd.RVD$$anonfun$3$$anon$1.hasNext(RVD.scala:226)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.hasNext(OrderedRVD.scala:923)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:1004)
	at is.hail.utils.richUtils.RichIterator$$anon$5.isValid(RichIterator.scala:21)
	at is.hail.utils.StagingIterator.isValid(FlipbookIterator.scala:46)
	at is.hail.utils.FlipbookIterator$$anon$1.calculateValidity(FlipbookIterator.scala:178)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine$class.refreshValidity(FlipbookIterator.scala:167)
	at is.hail.utils.FlipbookIterator$$anon$1.refreshValidity(FlipbookIterator.scala:176)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine$class.$init$(FlipbookIterator.scala:171)
	at is.hail.utils.FlipbookIterator$$anon$1.<init>(FlipbookIterator.scala:176)
	at is.hail.utils.FlipbookIterator.staircased(FlipbookIterator.scala:176)
	at is.hail.utils.FlipbookIterator.cogroup(FlipbookIterator.scala:212)
	at is.hail.utils.FlipbookIterator.leftJoinDistinct(FlipbookIterator.scala:281)
	at is.hail.annotations.OrderedRVIterator.leftJoinDistinct(OrderedRVIterator.scala:59)
	at is.hail.rvd.KeyedOrderedRVD$$anonfun$6.apply(KeyedOrderedRVD.scala:84)
	at is.hail.rvd.KeyedOrderedRVD$$anonfun$6.apply(KeyedOrderedRVD.scala:84)
	at is.hail.rvd.KeyedOrderedRVD$$anonfun$orderedJoinDistinct$1.apply(KeyedOrderedRVD.scala:95)
	at is.hail.rvd.KeyedOrderedRVD$$anonfun$orderedJoinDistinct$1.apply(KeyedOrderedRVD.scala:92)
	at is.hail.sparkextras.ContextRDD$$anonfun$czipPartitions$1$$anonfun$apply$26.apply(ContextRDD.scala:355)
	at is.hail.sparkextras.ContextRDD$$anonfun$czipPartitions$1$$anonfun$apply$26.apply(ContextRDD.scala:355)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$22$$anonfun$apply$23.apply(ContextRDD.scala:308)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$22$$anonfun$apply$23.apply(ContextRDD.scala:308)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.hasNext(OrderedRVD.scala:923)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.hasNext(OrderedRVD.scala:923)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at is.hail.rvd.OrderedRVD$$anonfun$apply$16$$anon$3.hasNext(OrderedRVD.scala:923)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at is.hail.utils.package$.getIteratorSizeWithMaxN(package.scala:347)
	at is.hail.sparkextras.ContextRDD$$anonfun$12.apply(ContextRDD.scala:442)
	at is.hail.sparkextras.ContextRDD$$anonfun$12.apply(ContextRDD.scala:442)
	at is.hail.sparkextras.ContextRDD$$anonfun$runJob$1.apply(ContextRDD.scala:469)
	at is.hail.sparkextras.ContextRDD$$anonfun$runJob$1.apply(ContextRDD.scala:467)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is a bug, and even if it were a user error, you should never see an error message like this. Can you post the issue to github?

No problem – raised as issue 3790 (for some reason links to GitHub are forbidden).

links to github are forbidden??

oh, must be a discourse user permissions thing. let me look into it.

made the spam filter more lenient.

1 Like