Issues writing matrix table from filtered pVCF (UK Biobank data)

lb61 · September 22, 2022, 7:58pm

Hi there! I’m attempting to use Hail on DNAnexus to work with UK Biobank WES data. Similarly to other users, I’ve been having some issues with speed and, even more generally, with getting basic commands to execute properly.

In particular, I read your suggestion to write imported VCFs to a matrix table and then read them back in to improve performance. I followed along with the DNAnexus tutorial for this, first filtering down to just one locus to try and work with the smallest dataset I could to start. However, I ended up with an assertion error (included with code below). I haven’t been able to just use mt.write() without the DNAX steps either – I either get a similar error or else the program just hangs. Not sure what a good solution would be, but I’d welcome any advice regarding working with the UKB pVCFs in Hail! (I also tried to work with the BGEN files instead, but received the “Hail only supports zlib compression” error that nbaya also encountered.)

While at the moment I primarily want to filter the WES data to identify individuals with variants in certain loci (a task that I know I could likely do elsewhere), my team hopes to end up doing more of our modeling work on Hail, and ideally we’d love to run our full data processing pipeline with Hail for simplicity as well. Thanks so much for any advice you can provide!

Code used to try and write imported and filtered pVCF to mt:

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

# Define variables used in import
chrom19_17_file = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c19_b17_v1.vcf.gz"

mt = hl.import_vcf(chrom19_17_file, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)

# filtering to locus
ldlr_11116926 = mt.filter_rows(mt.locus == hl.Locus.parse('19:11116926', reference_genome='GRCh38'))

# setting up and writing to db following DNAnexus instructions
db_name = "ldlr_db"
mt_name = "chr19_11116926.mt"

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

import dxpy

db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}" 

ldlr_11116926.write(url)

Write error (occurs on the last line of code above):

2022-09-21 14:24:02 Hail: INFO: Coerced sorted dataset
2022-09-21 14:47:33 Hail: INFO: Coerced sorted dataset
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-20-b995cd4ec429> in <module>
     10 # it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
     11 # See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
---> 12 ldlr_11116926.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions)

<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2542 
   2543         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2545 
   2546     class _Show:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    108                 raise HailUserError(message_and_trace) from None
    109 
--> 110             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     84         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     85         try:
---> 86             result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
     87             (result, timings) = (result_tuple._1(), result_tuple._2())
     88             value = ir.typ._from_encoding(result)

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: AssertionError: assertion failed

Java stack trace:
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:156)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:687)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:692)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:112)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:111)
	at is.hail.annotations.ExtendedOrdering$$anon$3.compareNonnull(ExtendedOrdering.scala:11)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2.compareNonnull(PartitionBoundOrdering.scala:57)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2$$anon$1.compareIntervalEndpoints(PartitionBoundOrdering.scala:136)
	at is.hail.annotations.IntervalEndpointOrdering.compareNonnull(ExtendedOrdering.scala:431)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.utils.Interval.isBelow(Interval.scala:109)
	at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
	at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
	at is.hail.utils.richUtils.RichIndexedSeq$.is$hail$utils$richUtils$RichIndexedSeq$$runSearch$extension(RichIndexedSeq.scala:147)
	at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension0(RichIndexedSeq.scala:88)
	at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension2(RichIndexedSeq.scala:94)
	at is.hail.rvd.RVDPartitioner.overlaps(RVDPartitioner.scala:283)
	at is.hail.rvd.RVD$$anonfun$3.apply$mcZI$sp(RVD.scala:627)
	at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
	at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at is.hail.rvd.RVD.filterToIntervals(RVD.scala:628)
	at is.hail.rvd.RVD.filterIntervals(RVD.scala:587)
	at is.hail.expr.ir.TableFilterIntervals.execute(TableIR.scala:2898)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.is$hail$expr$ir$LowerOrInterpretNonCompilable$$rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:16)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:14)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:417)
	at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:414)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:414)
	at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:413)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.78-b17627756568
Error summary: AssertionError: assertion failed

tpoterba · September 22, 2022, 8:01pm

Need an audit to see how this propagated to an assertion error…

The core problem is that ‘17’ isn’t a valid contig in grch38, you need ‘chr17’.

I think if you use hl.parse_locus instead of hl.Locus.parse you’ll get a nicer user error, and the same semantics.

Topic		Replies	Views
UKBiobank Research Analysis Platform (RAP) MatrixTable Write Issues Hail Query & hailctl	21	2546	February 22, 2023
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1909	August 23, 2024
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	452	December 5, 2023
Fail to retrieve row information of Hail matrix.table Hail Query & hailctl	5	523	July 22, 2022
Improve matrix write time? Hail Query & hailctl	19	782	October 29, 2019

Issues writing matrix table from filtered pVCF (UK Biobank data)

Related topics