Joint genotype example vcf file

Looking for a joint genotype vcf file to test Hail with. If I understand correctly, it needs to be called with GATK using GRCh37.

I keep running into '.' string literal errors with 1000G and other examples.

Thanks for the help =)

If you run

import hail as hl
hl.utils.get_1kg('/path/to/dir')

you’ll download a downsampled chunk of 1KG that we use for tutorials.

Could you paste a specific error + stack trace though? Would like to see. Also note that Hail does NOT require GRCh37!

Thanks Tim!

mt = hl.import_vcf('NA12878.compound_heterozygous.vcf').write('NA12878.compound_heterozygous.mt', overwrite=True)

2019-01-02 19:38:26 Hail: INFO: Ordering unsorted dataset with network shuffle
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-28-376ae1400f2a> in <module>
----> 1 mt = hl.import_vcf('NA12878.compound_heterozygous.vcf').write('NA12878.compound_heterozygous.mt', overwrite=True)

<decorator-gen-920> in write(self, output, overwrite, stage_locally, _codec_spec)

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    558     def wrapper(__original_func, *args, **kwargs):
    559         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 560         return __original_func(*args_, **kwargs_)
    561 
    562     return wrapper

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec)
   2154 
   2155         writer = MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec)
-> 2156         Env.hc()._backend.interpret(MatrixWrite(self._mir, writer))
   2157 
   2158     def globals_table(self) -> Table:

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/backend/backend.py in interpret(self, ir)
     23 
     24         typ = dtype(jir.typ().toString())
---> 25         result = Env.hail().expr.ir.Interpret.interpretPyIR(code, {}, ir_map)
     26 
     27         return typ._from_json(result)

~/anaconda3/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~/anaconda3/envs/hail/lib/python3.7/site-packages/hail/utils/java.py in deco(*args, **kwargs)
    208             raise FatalError('%s\n\nJava stack trace:\n%s\n'
    209                              'Hail version: %s\n'
--> 210                              'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
    211         except pyspark.sql.utils.CapturedException as e:
    212             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: VCFParseError: invalid character '.' in integer literal

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 40.0 failed 1 times, most recent failure: Lost task 0.0 in stage 40.0 (TID 120, localhost, executor driver): is.hail.utils.HailException: NA12878.compound_heterozygous.vcf:column 187: invalid character '.' in integer literal
... _ALIGN=FP GT:AD:DP:GQ:PL 1/1:0,5:3:9.03:117,9,0
                                        ^
offending line: 1	877831	rs6672356	T	C	83.49	PASS	DB;AC=2;AF=1.00;AN=2;DP=5;...
see the Hail log for the full offending line
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
	at is.hail.utils.package$.fatal(package.scala:26)
	at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:834)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: is.hail.io.vcf.VCFParseError: invalid character '.' in integer literal
	at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:44)
	at is.hail.io.vcf.VCFLine.numericValue(LoadVCF.scala:48)
	at is.hail.io.vcf.VCFLine.parseFormatInt(LoadVCF.scala:339)
	at is.hail.io.vcf.VCFLine.parseAddFormatInt(LoadVCF.scala:350)
	at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:555)
	at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:598)
	at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:897)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
	at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:812)
	... 13 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1533)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1521)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1520)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:153)
	at is.hail.io.RichContextRDDRegionValue$.writeRowsSplit$extension(RowStore.scala:1424)
	at is.hail.rvd.RVD.writeRowsSplit(RVD.scala:692)
	at is.hail.expr.ir.MatrixValue.write(MatrixValue.scala:112)
	at is.hail.expr.ir.MatrixNativeWriter.apply(MatrixWriter.scala:28)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:730)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:107)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:77)
	at is.hail.expr.ir.Interpret$.interpretPyIR(Interpret.scala:31)
	at is.hail.expr.ir.Interpret$.interpretPyIR(Interpret.scala:25)
	at is.hail.expr.ir.Interpret.interpretPyIR(Interpret.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

is.hail.utils.HailException: NA12878.compound_heterozygous.vcf:column 187: invalid character '.' in integer literal
... _ALIGN=FP GT:AD:DP:GQ:PL 1/1:0,5:3:9.03:117,9,0
                                        ^
offending line: 1	877831	rs6672356	T	C	83.49	PASS	DB;AC=2;AF=1.00;AN=2;DP=5;...
see the Hail log for the full offending line
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
	at is.hail.utils.package$.fatal(package.scala:26)
	at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:834)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

is.hail.io.vcf.VCFParseError: invalid character '.' in integer literal
	at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:44)
	at is.hail.io.vcf.VCFLine.numericValue(LoadVCF.scala:48)
	at is.hail.io.vcf.VCFLine.parseFormatInt(LoadVCF.scala:339)
	at is.hail.io.vcf.VCFLine.parseAddFormatInt(LoadVCF.scala:350)
	at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:555)
	at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:598)
	at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:897)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
	at is.hail.io.vcf.MatrixVCFReader$$anonfun$15.apply(LoadVCF.scala:1109)
	at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:812)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)





Hail version: 0.2.6-13183a31af2c
Error summary: VCFParseError: invalid character '.' in integer literal

ah, the problem is that the GQ field is a float instead of an integer – this is a violation of the VCF 4.2 spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf

At some point we’ll relax our parser to use the type declared in the header, but we use htsjdk to parse VCF headers for now, and it uses the known types for fields in the spec.

And here is what trigger the thought about GRCh37. Was trying different dummy vcf’s trying to get it to work.

mt = hl.import_vcf('1000G_1MB_exons.genotypes.vcf').write('1000G_1MB_exons.genotypes.vcf', overwrite=True)

FatalError: HailException: Invalid locus `3:198723451' found. Position `198723451' is not within the range [1-198022430] for reference genome `GRCh37'.

Taking something away from this, I would suggest making the markdown here a bit more explicit. “We have built in dummy data for you”

https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html#Check-for-tutorial-data-or-download-if-necessary

The reference genome is a parameter of import_vcf, see the docs here.

These errors on import are much better than silently importing incorrect data! You can use reference_genome=None to import as string:int chrom:pos instead of a reference-genome-parameterized Locus type.

Will update the tutorial to be clear that this is downloading public data which we host.

I also added an issue to add an example of using a reference genome with import_vcf. https://github.com/hail-is/hail/issues/5086

So, a year has passed and, apparently, I’m facing a similar problem for which I failing to understand how to address.

I try to process my data with:

vds = hl.import_vcf('s3://.../merged2.vcf.gz',force_bgz=True)

v=hl.vep(vds,"s3://.../vep-configuration.json")

v.write('s3://.../merged2.ht', stage_locally=True, overwrite=True)

But it fails with:

FatalError: VCFParseError: empty integer field

Java stack trace:
java.lang.RuntimeException: error while applying lowering 'InterpretNonCompilable'
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:26)
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 4.0 failed 4 times, most recent failure: Lost task 9.3 in stage 4.0 (TID 2494, ip-172-31-44-70.eu-west-2.compute.internal, executor 3): is.hail.utils.HailException: merged2.vcf.gz:column 1423: empty integer field
...  GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:0::.:. ./.:0,0:0:0::.:. ./.:0,0:0:0::. ...
                                        ^
offending line: 1	9009451	rs2274329	G	C	1.20482e+06	PASS	BaseQRankSum=-0.113...
see the Hail log for the full offending line
...
is.hail.utils.HailException: merged2.vcf.gz:column 1423: empty integer field
...  GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:0::.:. ./.:0,0:0:0::.:. ./.:0,0:0:0::. ...
                                        ^
offending line: 1	9009451	rs2274329	G	C	1.20482e+06	PASS	BaseQRankSum=-0.113...
see the Hail log for the full offending line
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:20)
...
Hail version: 0.2.39-c1106386d669
Error summary: VCFParseError: empty integer field

Would you have any advice here please? Many thanks in advance, Alan.

It looks like your VCF has a genotype FORMAT entry that’s empty: ::, above the carat. This is the PGT field, based on the FORMAT descriptor.

This is a malformatted (invalid) VCF, but you might be able to import it by using the find_replace option on import to rewrite :: to :.::

mt = hl.import_vcf(...,
    find_replace=('::'. ':.:'))

Thanks Tim. In the end, I didn’t have VEP running in AWS. I will try with my data in GCP to see if this problem is real.