Issues writing matrix table from filtered pVCF (UK Biobank data)

Hi there! I’m attempting to use Hail on DNAnexus to work with UK Biobank WES data. Similarly to other users, I’ve been having some issues with speed and, even more generally, with getting basic commands to execute properly.

In particular, I read your suggestion to write imported VCFs to a matrix table and then read them back in to improve performance. I followed along with the DNAnexus tutorial for this, first filtering down to just one locus to try and work with the smallest dataset I could to start. However, I ended up with an assertion error (included with code below). I haven’t been able to just use mt.write() without the DNAX steps either – I either get a similar error or else the program just hangs. Not sure what a good solution would be, but I’d welcome any advice regarding working with the UKB pVCFs in Hail! (I also tried to work with the BGEN files instead, but received the “Hail only supports zlib compression” error that nbaya also encountered.)

While at the moment I primarily want to filter the WES data to identify individuals with variants in certain loci (a task that I know I could likely do elsewhere), my team hopes to end up doing more of our modeling work on Hail, and ideally we’d love to run our full data processing pipeline with Hail for simplicity as well. Thanks so much for any advice you can provide!

Code used to try and write imported and filtered pVCF to mt:

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

# Define variables used in import
chrom19_17_file = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c19_b17_v1.vcf.gz"

mt = hl.import_vcf(chrom19_17_file, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)

# filtering to locus
ldlr_11116926 = mt.filter_rows(mt.locus == hl.Locus.parse('19:11116926', reference_genome='GRCh38'))

# setting up and writing to db following DNAnexus instructions
db_name = "ldlr_db"
mt_name = "chr19_11116926.mt"

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

import dxpy

db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}" 

ldlr_11116926.write(url) 

Write error (occurs on the last line of code above):

2022-09-21 14:24:02 Hail: INFO: Coerced sorted dataset
2022-09-21 14:47:33 Hail: INFO: Coerced sorted dataset
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-20-b995cd4ec429> in <module>
     10 # it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
     11 # See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
---> 12 ldlr_11116926.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions)

<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2542 
   2543         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2545 
   2546     class _Show:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    108                 raise HailUserError(message_and_trace) from None
    109 
--> 110             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     84         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     85         try:
---> 86             result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
     87             (result, timings) = (result_tuple._1(), result_tuple._2())
     88             value = ir.typ._from_encoding(result)

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: AssertionError: assertion failed

Java stack trace:
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:156)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:687)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:692)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:112)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:111)
	at is.hail.annotations.ExtendedOrdering$$anon$3.compareNonnull(ExtendedOrdering.scala:11)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2.compareNonnull(PartitionBoundOrdering.scala:57)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2$$anon$1.compareIntervalEndpoints(PartitionBoundOrdering.scala:136)
	at is.hail.annotations.IntervalEndpointOrdering.compareNonnull(ExtendedOrdering.scala:431)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.utils.Interval.isBelow(Interval.scala:109)
	at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
	at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
	at is.hail.utils.richUtils.RichIndexedSeq$.is$hail$utils$richUtils$RichIndexedSeq$$runSearch$extension(RichIndexedSeq.scala:147)
	at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension0(RichIndexedSeq.scala:88)
	at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension2(RichIndexedSeq.scala:94)
	at is.hail.rvd.RVDPartitioner.overlaps(RVDPartitioner.scala:283)
	at is.hail.rvd.RVD$$anonfun$3.apply$mcZI$sp(RVD.scala:627)
	at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
	at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at is.hail.rvd.RVD.filterToIntervals(RVD.scala:628)
	at is.hail.rvd.RVD.filterIntervals(RVD.scala:587)
	at is.hail.expr.ir.TableFilterIntervals.execute(TableIR.scala:2898)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.is$hail$expr$ir$LowerOrInterpretNonCompilable$$rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:16)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:14)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:417)
	at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:414)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:414)
	at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:413)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.78-b17627756568
Error summary: AssertionError: assertion failed

Need an audit to see how this propagated to an assertion error…

The core problem is that ‘17’ isn’t a valid contig in grch38, you need ‘chr17’.

I think if you use hl.parse_locus instead of hl.Locus.parse you’ll get a nicer user error, and the same semantics.