VEP annotation errors with ClinVar

Hi all,

I’ve gotten VEP annotations working well with Hail, and now we want to use it to consistently annotate all of our variant datasets. However, I’m running into errors trying to annotate the public ClinVar VCF file (version 20171029). I’m able to import to VDS and run split_multi, variant_qc, and deduplicate with no problems, and some chromosomes are annotated just fine:

vds_annot = (vds
    .filter_variants_expr('v.contig == "21"')
    .vep('vep.properties')
)

2017-12-01 18:40:42 Hail: INFO: vep: annotated 22284 variants

But when running other chromosomes, somewhere there is a variant causing problems:

FatalError: NumberFormatException: For input string: "0.09848,-:0.01515,-:0"

I’m not sure how to debug this, since no part of this string can be found by grep in the original VCF, and the original variant is nowhere in the error message.

Any advice on what could be going wrong, or even how to begin to debug this?

Thanks!
Jake


FatalError Traceback (most recent call last)
in ()
----> 1 vds_annot = vds.filter_variants_expr(‘v.contig == “19”’).vep(‘…/vdstools/vep.properties’)

in vep(self, config, block_size, root, csq)

/home/hadoop/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
119 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’
120 ‘Hail version: %s\n’
→ 121 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
122 except py4j.protocol.Py4JError as e:
123 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: NumberFormatException: For input string: “0.09848,-:0.01515,-:0”

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 268 in stage 18.0 failed 4 times, most recent failure: Lost task 268.3 in stage 18.0 (TID 2138, ip-10-10-112-170.ec2.internal, executor 17): java.lang.NumberFormatException: For input string: “0.09848,-:0.01515,-:0”
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:324)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:363)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:360)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:360)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$16.apply(AnnotationImpex.scala:385)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$16.apply(AnnotationImpex.scala:385)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:385)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:363)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:360)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:360)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:302)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2$$anonfun$14.apply(VEP.scala:353)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2$$anonfun$14.apply(VEP.scala:322)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2.apply(VEP.scala:377)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2.apply(VEP.scala:310)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at is.hail.methods.VEP$.annotate(VEP.scala:389)
at is.hail.variant.VariantSampleMatrix.vep(VariantSampleMatrix.scala:2057)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)java.lang.NumberFormatException: For input string: “0.09848,-:0.01515,-:0”
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:324)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:363)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:360)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:360)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$16.apply(AnnotationImpex.scala:385)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$16.apply(AnnotationImpex.scala:385)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:385)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:363)
at is.hail.expr.JSONAnnotationImpex$$anonfun$importAnnotation$15.apply(AnnotationImpex.scala:360)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:360)
at is.hail.expr.JSONAnnotationImpex$.importAnnotation(AnnotationImpex.scala:302)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2$$anonfun$14.apply(VEP.scala:353)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2$$anonfun$14.apply(VEP.scala:322)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2.apply(VEP.scala:377)
at is.hail.methods.VEP$$anonfun$6$$anonfun$apply$2.apply(VEP.scala:310)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Hail version: 0.1-e8d0f38
Error summary: NumberFormatException: For input string: “0.09848,-:0.01515,-:0”

I think we’ve seen this before. @bw2?

From what I can gather from the source code, VEP is running successfully but its JSON output is formatted wrong?

I added a bit of logging and found that VEP is producing weird strings for ExAC maf values. The input is just a normal SNP:

FatalError: HailException: vep output {“input”:“2\t178769661\t.\tA\tT\t.\t.\tGT”,“colocated_variants”:[{“exac_oth_maf”:“-:0,-:0”,“exac_sas_maf”:“-:0.0007621,-:0.0001693”,“exac_adj_maf”:“-:0.0003146,-:7.491e-05”,“exac_afr_maf”:“-:0,-:0.0003885”,“strand”:1,“id”:“rs201200643”,“exac_eas_maf”:“-:0.003343,-:0”,“allele_string”:“A/-”,“exac_nfe_maf”:“-:5.561e-05,-:2.781e-05”,“exac_amr_maf”:“-:0.0001449,-:0”,“end”:178769661,“exac_fin_maf”:“-:0,-:0”,“exac_maf”:“-:3.899e-04,-:4.977e-05”,“start”:178769661}],“assembly_name”:“GRCh38”,“end”:178769661,“seq_region_name”:“2”,“variant_class”:"SN…

Will dig into VEP to find out why.

Yep, it’s producing a field with -: in it when a number is expected.

I believe that Ben Weisburd has seen this error before, I don’t remember how he fixed it. I’ve asked him to post here when he gets a chance.

thanks, looking forward to it!

We handle “-:” on a single number, but it seems VEP is producing a comma-separated sequence of such numbers as a JSON string. Is that allowed?

Hi guys, I don’t think I’ve seen that error. Maybe it’s a VEP version issue?
I’m able to run vds.vep(…) on the latest clinvar release - though my vds is generated through a custom pre-processing pipeline that’s based on the clinvar .xml output (the code is in:
https://github.com/macarthur-lab/clinvar/tree/master # xml => .tsv
https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/utils/create_reference_datasets/create_clinvar_vds.py # tsv => vds )

I’m using hail-0.1 with VEP 85_GRCh38 initialized using gs://hail-common/vep/vep/GRCh38/vep85-GRCh38-init.sh
(https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/create_cluster_GRCh38.py )
and it annotates that 2:178769661:A:T variant without error.