Looking for a joint genotype vcf file to test Hail with. If I understand correctly, it needs to be called with GATK using GRCh37.
I keep running into '.' string literal errors
with 1000G and other examples.
Thanks for the help =)
Looking for a joint genotype vcf file to test Hail with. If I understand correctly, it needs to be called with GATK using GRCh37.
I keep running into '.' string literal errors
with 1000G and other examples.
Thanks for the help =)
If you run
import hail as hl
hl.utils.get_1kg('/path/to/dir')
you’ll download a downsampled chunk of 1KG that we use for tutorials.
Could you paste a specific error + stack trace though? Would like to see. Also note that Hail does NOT require GRCh37!
Thanks Tim!
mt = hl.import_vcf('NA12878.compound_heterozygous.vcf').write('NA12878.compound_heterozygous.mt', overwrite=True)
2019-01-02 19:38:26 Hail: INFO: Ordering unsorted dataset with network shuffle
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-28-376ae1400f2a> in <module>
----> 1 mt = hl.import_vcf('NA12878.compound_heterozygous.vcf').write('NA12878.compound_heterozygous.mt', overwrite=True)
<decorator-gen-920> in write(self, output, overwrite, stage_locally, _codec_spec)
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(__original_func, *args, **kwargs):
559 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 560 return __original_func(*args_, **kwargs_)
561
562 return wrapper
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec)
2154
2155 writer = MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec)
-> 2156 Env.hc()._backend.interpret(MatrixWrite(self._mir, writer))
2157
2158 def globals_table(self) -> Table:
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/backend/backend.py in interpret(self, ir)
23
24 typ = dtype(jir.typ().toString())
---> 25 result = Env.hail().expr.ir.Interpret.interpretPyIR(code, {}, ir_map)
26
27 return typ._from_json(result)
~/anaconda3/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/utils/java.py in deco(*args, **kwargs)
208 raise FatalError('%s\n\nJava stack trace:\n%s\n'
209 'Hail version: %s\n'
--> 210 'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
211 except pyspark.sql.utils.CapturedException as e:
212 raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: VCFParseError: invalid character '.' in integer literal
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 40.0 failed 1 times, most recent failure: Lost task 0.0 in stage 40.0 (TID 120, localhost, executor driver): is.hail.utils.HailException: NA12878.compound_heterozygous.vcf:column 187: invalid character '.' in integer literal
... _ALIGN=FP GT:AD:DP:GQ:PL 1/1:0,5:3:9.03:117,9,0
^
offending line: 1 877831 rs6672356 T C 83.49 PASS DB;AC=2;AF=1.00;AN=2;DP=5;...
see the Hail log for the full offending line
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
at is.hail.utils.package$.fatal(package.scala:26)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:834)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: is.hail.io.vcf.VCFParseError: invalid character '.' in integer literal
at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:44)
at is.hail.io.vcf.VCFLine.numericValue(LoadVCF.scala:48)
at is.hail.io.vcf.VCFLine.parseFormatInt(LoadVCF.scala:339)
at is.hail.io.vcf.VCFLine.parseAddFormatInt(LoadVCF.scala:350)
at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:555)
at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:598)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:897)
at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:812)
... 13 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1533)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1521)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1520)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:153)
at is.hail.io.RichContextRDDRegionValue$.writeRowsSplit$extension(RowStore.scala:1424)
at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:692)
at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:112)
at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:28)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:730)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:107)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:77)
at is.hail.expr.ir.Interpret$.interpretPyIR(Interpret.scala:31)
at is.hail.expr.ir.Interpret$.interpretPyIR(Interpret.scala:25)
at is.hail.expr.ir.Interpret.interpretPyIR(Interpret.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
is.hail.utils.HailException: NA12878.compound_heterozygous.vcf:column 187: invalid character '.' in integer literal
... _ALIGN=FP GT:AD:DP:GQ:PL 1/1:0,5:3:9.03:117,9,0
^
offending line: 1 877831 rs6672356 T C 83.49 PASS DB;AC=2;AF=1.00;AN=2;DP=5;...
see the Hail log for the full offending line
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
at is.hail.utils.package$.fatal(package.scala:26)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:834)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
is.hail.io.vcf.VCFParseError: invalid character '.' in integer literal
at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:44)
at is.hail.io.vcf.VCFLine.numericValue(LoadVCF.scala:48)
at is.hail.io.vcf.VCFLine.parseFormatInt(LoadVCF.scala:339)
at is.hail.io.vcf.VCFLine.parseAddFormatInt(LoadVCF.scala:350)
at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:555)
at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:598)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:897)
at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:812)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.6-13183a31af2c
Error summary: VCFParseError: invalid character '.' in integer literal
ah, the problem is that the GQ field is a float instead of an integer – this is a violation of the VCF 4.2 spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf
At some point we’ll relax our parser to use the type declared in the header, but we use htsjdk to parse VCF headers for now, and it uses the known types for fields in the spec.
And here is what trigger the thought about GRCh37. Was trying different dummy vcf’s trying to get it to work.
mt = hl.import_vcf('1000G_1MB_exons.genotypes.vcf').write('1000G_1MB_exons.genotypes.vcf', overwrite=True)
FatalError: HailException: Invalid locus `3:198723451' found. Position `198723451' is not within the range [1-198022430] for reference genome `GRCh37'.
Taking something away from this, I would suggest making the markdown here a bit more explicit. “We have built in dummy data for you”
The reference genome is a parameter of import_vcf
, see the docs here.
These errors on import are much better than silently importing incorrect data! You can use reference_genome=None
to import as string:int chrom:pos instead of a reference-genome-parameterized Locus type.
Will update the tutorial to be clear that this is downloading public data which we host.
I also added an issue to add an example of using a reference genome with import_vcf
. https://github.com/hail-is/hail/issues/5086
So, a year has passed and, apparently, I’m facing a similar problem for which I failing to understand how to address.
I try to process my data with:
vds = hl.import_vcf('s3://.../merged2.vcf.gz',force_bgz=True)
v=hl.vep(vds,"s3://.../vep-configuration.json")
v.write('s3://.../merged2.ht', stage_locally=True, overwrite=True)
But it fails with:
FatalError: VCFParseError: empty integer field
Java stack trace:
java.lang.RuntimeException: error while applying lowering 'InterpretNonCompilable'
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:26)
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 4.0 failed 4 times, most recent failure: Lost task 9.3 in stage 4.0 (TID 2494, ip-172-31-44-70.eu-west-2.compute.internal, executor 3): is.hail.utils.HailException: merged2.vcf.gz:column 1423: empty integer field
... GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:0::.:. ./.:0,0:0:0::.:. ./.:0,0:0:0::. ...
^
offending line: 1 9009451 rs2274329 G C 1.20482e+06 PASS BaseQRankSum=-0.113...
see the Hail log for the full offending line
...
is.hail.utils.HailException: merged2.vcf.gz:column 1423: empty integer field
... GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:0::.:. ./.:0,0:0:0::.:. ./.:0,0:0:0::. ...
^
offending line: 1 9009451 rs2274329 G C 1.20482e+06 PASS BaseQRankSum=-0.113...
see the Hail log for the full offending line
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
...
Hail version: 0.2.39-c1106386d669
Error summary: VCFParseError: empty integer field
Would you have any advice here please? Many thanks in advance, Alan.
It looks like your VCF has a genotype FORMAT entry that’s empty: ::
, above the carat. This is the PGT field, based on the FORMAT descriptor.
This is a malformatted (invalid) VCF, but you might be able to import it by using the find_replace
option on import to rewrite ::
to :.:
:
mt = hl.import_vcf(...,
find_replace=('::'. ':.:'))
Thanks Tim. In the end, I didn’t have VEP running in AWS. I will try with my data in GCP to see if this problem is real.