Hi there! I’m attempting to use Hail on DNAnexus to work with UK Biobank WES data. Similarly to other users, I’ve been having some issues with speed and, even more generally, with getting basic commands to execute properly.
In particular, I read your suggestion to write imported VCFs to a matrix table and then read them back in to improve performance. I followed along with the DNAnexus tutorial for this, first filtering down to just one locus to try and work with the smallest dataset I could to start. However, I ended up with an assertion error (included with code below). I haven’t been able to just use mt.write() without the DNAX steps either – I either get a similar error or else the program just hangs. Not sure what a good solution would be, but I’d welcome any advice regarding working with the UKB pVCFs in Hail! (I also tried to work with the BGEN files instead, but received the “Hail only supports zlib compression” error that nbaya also encountered.)
While at the moment I primarily want to filter the WES data to identify individuals with variants in certain loci (a task that I know I could likely do elsewhere), my team hopes to end up doing more of our modeling work on Hail, and ideally we’d love to run our full data processing pipeline with Hail for simplicity as well. Thanks so much for any advice you can provide!
Code used to try and write imported and filtered pVCF to mt:
from pyspark.sql import SparkSession
import hail as hl
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
# Define variables used in import
chrom19_17_file = "file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c19_b17_v1.vcf.gz"
mt = hl.import_vcf(chrom19_17_file,
force_bgz=True,
reference_genome="GRCh38",
array_elements_required=False)
# filtering to locus
ldlr_11116926 = mt.filter_rows(mt.locus == hl.Locus.parse('19:11116926', reference_genome='GRCh38'))
# setting up and writing to db following DNAnexus instructions
db_name = "ldlr_db"
mt_name = "chr19_11116926.mt"
stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()
import dxpy
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}"
ldlr_11116926.write(url)
Write error (occurs on the last line of code above):
2022-09-21 14:24:02 Hail: INFO: Coerced sorted dataset
2022-09-21 14:47:33 Hail: INFO: Coerced sorted dataset
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-20-b995cd4ec429> in <module>
10 # it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
11 # See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
---> 12 ldlr_11116926.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions)
<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(__original_func, *args, **kwargs):
576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577 return __original_func(*args_, **kwargs_)
578
579 return wrapper
/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
2542
2543 writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544 Env.backend().execute(ir.MatrixWrite(self._mir, writer))
2545
2546 class _Show:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
108 raise HailUserError(message_and_trace) from None
109
--> 110 raise e
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
84 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
85 try:
---> 86 result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
87 (result, timings) = (result_tuple._1(), result_tuple._2())
88 value = ir.typ._from_encoding(result)
/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 raise FatalError('%s\n\nJava stack trace:\n%s\n'
30 'Hail version: %s\n'
---> 31 'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: AssertionError: assertion failed
Java stack trace:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:687)
at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:692)
at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:112)
at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:111)
at is.hail.annotations.ExtendedOrdering$$anon$3.compareNonnull(ExtendedOrdering.scala:11)
at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
at is.hail.rvd.PartitionBoundOrdering$$anon$2.compareNonnull(PartitionBoundOrdering.scala:57)
at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
at is.hail.rvd.PartitionBoundOrdering$$anon$2$$anon$1.compareIntervalEndpoints(PartitionBoundOrdering.scala:136)
at is.hail.annotations.IntervalEndpointOrdering.compareNonnull(ExtendedOrdering.scala:431)
at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
at is.hail.utils.Interval.isBelow(Interval.scala:109)
at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
at is.hail.rvd.RVDPartitioner$$anonfun$4.apply(RVDPartitioner.scala:57)
at is.hail.utils.richUtils.RichIndexedSeq$.is$hail$utils$richUtils$RichIndexedSeq$$runSearch$extension(RichIndexedSeq.scala:147)
at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension0(RichIndexedSeq.scala:88)
at is.hail.utils.richUtils.RichIndexedSeq$.containsOrdered$extension2(RichIndexedSeq.scala:94)
at is.hail.rvd.RVDPartitioner.overlaps(RVDPartitioner.scala:283)
at is.hail.rvd.RVD$$anonfun$3.apply$mcZI$sp(RVD.scala:627)
at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
at is.hail.rvd.RVD$$anonfun$3.apply(RVD.scala:627)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at is.hail.rvd.RVD.filterToIntervals(RVD.scala:628)
at is.hail.rvd.RVD.filterIntervals(RVD.scala:587)
at is.hail.expr.ir.TableFilterIntervals.execute(TableIR.scala:2898)
at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
at is.hail.expr.ir.Interpret$.run(Interpret.scala:852)
at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.is$hail$expr$ir$LowerOrInterpretNonCompilable$$rewrite$1(LowerOrInterpretNonCompilable.scala:67)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:16)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:14)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:14)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:15)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:13)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:381)
at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:417)
at is.hail.backend.spark.SparkBackend$$anonfun$8$$anonfun$apply$4.apply(SparkBackend.scala:414)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:414)
at is.hail.backend.spark.SparkBackend$$anonfun$8.apply(SparkBackend.scala:413)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:413)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.78-b17627756568
Error summary: AssertionError: assertion failed