UKBiobank Research Analysis Platform (RAP) MatrixTable Write Issues

Editor’s Note:

The Hail team does not recommend the solution posted here, please read the entire thread for details and possible alternatives.

~ @danking


This may be an issue with the UKB RAP but I cannot tell. In the simplest case, I am just trying to read and write a MatrixTable as:

import pyspark
import dxpy
import subprocess
import hail as hl

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

hl.init(sc=sc, default_reference='GRCh38')

vcf_file = ['file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format/ukb23156_c19_b46_v1.vcf.gz',]

region = [
    hl.parse_locus_interval(
        "[chr19:45668221-chr19:45683722]"
    )
]

mts = hl.import_gvcfs(
    vcf_file,
    partitions=region,
    reference_genome="GRCh38",
    array_elements_required=False,
)

mt = mts[0]

print(mt.count())

mt.write("file:/opt/notebooks/GIPR.mt")

subprocess.run(["dx", "upload", "/opt/notebooks/GIPR.mt", "-r", "--path", "/"], check = True, shell = False)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-10-60-47-217.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /opt/notebooks/gogoGPCR/notebooks/hail-20210920-1148-0.2.61-3c86d3ba497a.log
(828, 200643)
2021-09-20 11:56:29 Hail: INFO: wrote matrix table with 828 rows and 200643 columns in 1 partition to file:/opt/notebooks/GIPR.mt
    Total size: 452.53 MiB
    * Rows/entries: 451.40 MiB
    * Columns: 1.13 MiB
    * Globals: 11.00 B
    * Smallest partition: 828 rows (451.40 MiB)
    * Largest partition:  828 rows (451.40 MiB)
CompletedProcess(args=['dx', 'upload', '/opt/notebooks/GIPR.mt', '-r', '--path', '/'], returncode=0)

I have attached the log for writing and reading the MatrixTable from the same environment. Trying to read the same MatrixTable in a different environment gives:

dxpy.download_folder("project-xxx", "/opt/notebooks/GIPR.mt", "/GIPR.mt")
mt = hl.read_matrix_table("file:/opt/notebooks/GIPR.mt")
mt.show()
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/opt/conda/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

/opt/conda/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in __repr__(self)
   2541 
   2542         def __repr__(self):
-> 2543             return self.__str__()
   2544 
   2545         def _repr_html_(self):

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in __str__(self)
   2535 
   2536         def __str__(self):
-> 2537             s = self.table_show.__str__()
   2538             if self.displayed_n_cols != self.actual_n_cols:
   2539                 s += f"showing the first { self.displayed_n_cols } of { self.actual_n_cols } columns"

/opt/conda/lib/python3.6/site-packages/hail/table.py in __str__(self)
   1292 
   1293         def __str__(self):
-> 1294             return self._ascii_str()
   1295 
   1296         def __repr__(self):

/opt/conda/lib/python3.6/site-packages/hail/table.py in _ascii_str(self)
   1318                 return s
   1319 
-> 1320             rows, has_more, dtype = self.data()
   1321             fields = list(dtype)
   1322             trunc_fields = [trunc(f) for f in fields]

/opt/conda/lib/python3.6/site-packages/hail/table.py in data(self)
   1302                 row_dtype = t.row.dtype
   1303                 t = t.select(**{k: hl._showstr(v) for (k, v) in t.row.items()})
-> 1304                 rows, has_more = t._take_n(self.n)
   1305                 self._data = (rows, has_more, row_dtype)
   1306             return self._data

/opt/conda/lib/python3.6/site-packages/hail/table.py in _take_n(self, n)
   1449             has_more = False
   1450         else:
-> 1451             rows = self.take(n + 1)
   1452             has_more = len(rows) > n
   1453             rows = rows[:n]

<decorator-gen-1119> in take(self, n, _localize)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in take(self, n, _localize)
   2119         """
   2120 
-> 2121         return self.head(n).collect(_localize)
   2122 
   2123     @typecheck_method(n=int)

<decorator-gen-1113> in collect(self, _localize)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in collect(self, _localize)
   1918         e = construct_expr(rows_ir, hl.tarray(t.row.dtype))
   1919         if _localize:
-> 1920             return Env.backend().execute(e._ir)
   1921         else:
   1922             return e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     96                 raise HailUserError(message_and_trace) from None
     97 
---> 98             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     72         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     73         try:
---> 74             result = json.loads(self._jhc.backend().executeJSON(jir))
     75             value = ir.typ._from_json(result['value'])
     76             timings = result['timings']

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     30                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     31                                  'Hail version: %s\n'
---> 32                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     33         except pyspark.sql.utils.CapturedException as e:
     34             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-60-2-89.eu-west-2.compute.internal, executor 0): java.io.FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83)
	at is.hail.io.fs.FS$class.open(FS.scala:139)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS$class.open(FS.scala:148)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.HailContext$$anon$1.compute(HailContext.scala:276)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2001)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1984)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1983)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1983)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1033)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1033)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1033)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2223)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2172)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2161)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at is.hail.sparkextras.ContextRDD.runJob(ContextRDD.scala:351)
	at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526)
	at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526)
	at is.hail.utils.PartitionCounts$.incrementalPCSubsetOffset(PartitionCounts.scala:73)
	at is.hail.rvd.RVD.head(RVD.scala:525)
	at is.hail.expr.ir.TableSubset$class.execute(TableIR.scala:1326)
	at is.hail.expr.ir.TableHead.execute(TableIR.scala:1332)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1845)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:819)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.io.FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83)
	at is.hail.io.fs.FS$class.open(FS.scala:139)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS$class.open(FS.scala:148)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.HailContext$$anon$1.compute(HailContext.scala:276)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.61-3c86d3ba497a
Error summary: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in _repr_html_(self)
   2544 
   2545         def _repr_html_(self):
-> 2546             s = self.table_show._repr_html_()
   2547             if self.displayed_n_cols != self.actual_n_cols:
   2548                 s += '<p style="background: #fdd; padding: 0.4em;">'

/opt/conda/lib/python3.6/site-packages/hail/table.py in _repr_html_(self)
   1307 
   1308         def _repr_html_(self):
-> 1309             return self._html_str()
   1310 
   1311         def _ascii_str(self):

/opt/conda/lib/python3.6/site-packages/hail/table.py in _html_str(self)
   1397             types = self.types
   1398 
-> 1399             rows, has_more, dtype = self.data()
   1400             fields = list(dtype)
   1401 

/opt/conda/lib/python3.6/site-packages/hail/table.py in data(self)
   1302                 row_dtype = t.row.dtype
   1303                 t = t.select(**{k: hl._showstr(v) for (k, v) in t.row.items()})
-> 1304                 rows, has_more = t._take_n(self.n)
   1305                 self._data = (rows, has_more, row_dtype)
   1306             return self._data

/opt/conda/lib/python3.6/site-packages/hail/table.py in _take_n(self, n)
   1449             has_more = False
   1450         else:
-> 1451             rows = self.take(n + 1)
   1452             has_more = len(rows) > n
   1453             rows = rows[:n]

<decorator-gen-1119> in take(self, n, _localize)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in take(self, n, _localize)
   2119         """
   2120 
-> 2121         return self.head(n).collect(_localize)
   2122 
   2123     @typecheck_method(n=int)

<decorator-gen-1113> in collect(self, _localize)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in collect(self, _localize)
   1918         e = construct_expr(rows_ir, hl.tarray(t.row.dtype))
   1919         if _localize:
-> 1920             return Env.backend().execute(e._ir)
   1921         else:
   1922             return e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     96                 raise HailUserError(message_and_trace) from None
     97 
---> 98             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     72         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     73         try:
---> 74             result = json.loads(self._jhc.backend().executeJSON(jir))
     75             value = ir.typ._from_json(result['value'])
     76             timings = result['timings']

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     30                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     31                                  'Hail version: %s\n'
---> 32                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     33         except pyspark.sql.utils.CapturedException as e:
     34             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist

SHORTENED DUE TO CHARACTER LIMIT. LOG ATTACHED. 


Hail version: 0.2.61-3c86d3ba497a
Error summary: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist

I cannot for the life of me figure out what is going on. It seems the rows/rows/parts/ folder is empty somehow? I hope it makes sense but let me know if it does not.

Also, info on the Jupyter environment is here.

read_MatrixTable_same_environment.log (192.6 KB)
write_MatrixTable.log (178.0 KB)
read_matrixtable.log (197.2 KB)

What is the runtime? I’m assuming this is running on a cluster?

I think the core issue here is that the file system you’re writing to / reading from is a local file system, not network-visible. The issue is similar to the one described here:

The solution is that you should write to a file system that’s network visible – a cloud object store like Google Storage / S3 / Azure Blob Storage, or a network file system like Lustre/HDFS.

I want to echo Tim here that you really don’t want to do this. Your suggestion involves three transfers of data which will be prohibitively slow for all but the smallest datasets:

  1. Write the data to Hadoop at data/1kg.mt
  2. Copy the data from Hadoop to your local file system (almost certainly of extremely limited size) at /opt/notebooks/1kg.mt.
  3. Copy the data form your local file system to whatever this dx thing is.

I don’t know much about the DNANexus platform, but if you hope to do any serious work on large datasets, you need to figure out how to use an object store like Google Cloud Storage, Amazon S3, or Azure Blob Storage from within DNANexus.


The root problem is that dxpy.download_folder command does not work the way you expect. I strongly suspect that it downloads to the Hadoop file system at /opt/notebooks/GIPR.mt. Downloading a huge file to a local file system just doesn’t make any sense.

Can you try the following and report back?

dxpy.download_folder("project-xxx", "/opt/notebooks/GIPR.mt", "/GIPR.mt")
mt = hl.read_matrix_table("/opt/notebooks/GIPR.mt")
mt.show()

If that fails can you please execute the following cells in your Jupyter notebook?

hadoop fs -ls /
hadoop fs -ls /opt/
hadoop fs -ls /opt/notebooks/
echo ===
ls /
ls /opt/
ls /opt/notebooks/

Hi Dan

Thanks for taking your time! I hope to do some serious work on the upcoming UKB 300k exomes release but their Material Transfer Agreement prohibits moving the data off of their own platform. It seems I am stuck with DNAnexus.

dxpy.download_folder("project-G493Kx8J860q68yGFpxkz18y", "/opt/notebooks/GIPR.mt", "/GIPR.mt")
mt = hl.read_matrix_table("/opt/notebooks/GIPR2.mt")
mt.show()

---------------------------------------------------------------------------
DXFileError                               Traceback (most recent call last)
<ipython-input-3-01bd0e165d00> in <module>
----> 1 dxpy.download_folder("project-G493Kx8J860q68yGFpxkz18y", "/opt/notebooks/GIPR.mt", "/GIPR.mt")
      2 mt = hl.read_matrix_table("/opt/notebooks/GIPR2.mt")
      3 mt.show()

/opt/conda/lib/python3.6/site-packages/dxpy/bindings/dxfile_functions.py in download_folder(project, destdir, folder, overwrite, chunksize, show_progress, **kwargs)
    720         if os.path.exists(local_filename) and not overwrite:
    721             raise DXFileError(
--> 722                 "Destination file '{}' already exists but no overwrite option is provided".format(local_filename)
    723             )
    724         logger.debug("Downloading '%s/%s' remote file to '%s' location",

DXFileError: Destination file '/opt/notebooks/GIPR.mt/_SUCCESS' already exists but no overwrite option is provided
dxpy.download_folder("project-G493Kx8J860q68yGFpxkz18y", "/opt/notebooks/GIPR.mt", "/GIPR.mt", overwrite = True)
mt = hl.read_matrix_table("/opt/notebooks/GIPR.mt")
mt.show()

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-5-79d48a9c6779> in <module>
      1 dxpy.download_folder("project-G493Kx8J860q68yGFpxkz18y", "/opt/notebooks/GIPR.mt", "/GIPR.mt", overwrite = True)
----> 2 mt = hl.read_matrix_table("/opt/notebooks/GIPR.mt")
      3 mt.show()

<decorator-gen-2099> in read_matrix_table(path, _intervals, _filter_intervals, _drop_cols, _drop_rows)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/impex.py in read_matrix_table(path, _intervals, _filter_intervals, _drop_cols, _drop_rows)
   1996     :class:`.MatrixTable`
   1997     """
-> 1998     for rg_config in Env.backend().load_references_from_dataset(path):
   1999         hl.ReferenceGenome._from_config(rg_config)
   2000 

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in load_references_from_dataset(self, path)
    320 
    321     def load_references_from_dataset(self, path):
--> 322         return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
    323 
    324     def from_fasta_file(self, name, fasta_file, index_file, x_contigs, y_contigs, mt_contigs, par):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     30                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     31                                  'Hail version: %s\n'
---> 32                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     33         except pyspark.sql.utils.CapturedException as e:
     34             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: HailException: MatrixTable and Table files are directories; path '/opt/notebooks/GIPR.mt' is not a directory

Java stack trace:
is.hail.utils.HailException: MatrixTable and Table files are directories; path '/opt/notebooks/GIPR.mt' is not a directory
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:77)
	at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:32)
	at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:66)
	at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:596)
	at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.61-3c86d3ba497a
Error summary: HailException: MatrixTable and Table files are directories; path '/opt/notebooks/GIPR.mt' is not a directory

Will post the other outputs in a separate reply.

%%bash
hadoop fs -ls /
hadoop fs -ls /opt/
hadoop fs -ls /opt/notebooks/
echo ===
ls /
ls /opt/
ls /opt/notebooks/
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-09-20 14:56 /eventlogs
===
bin
boot
cluster
dev
dxdata-0.30.2-py2.py3-none-any.whl
etc
get-docker.sh
home
install_r_kernel.R
install_r_packages.R
install_sparklyr.R
lib
lib64
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var
build_hail.sh
conda
jupyterlab_dx_extension
notebooks
start_jupyterlab.sh
GIPR.mt
Untitled.ipynb
hail-20210920-1445-0.2.61-3c86d3ba497a.log
hail-20210920-1456-0.2.61-3c86d3ba497a.log

ls: `/opt/': No such file or directory
ls: `/opt/notebooks/': No such file or directory

I hope it is not too much trouble. What goes on behind the scenes of Hail is a bit beyond me.

Also, documentation, in case it helps.

Huh, and what is the output of:

hadoop fs -ls /opt/notebooks/GIPR.mt/

?

The second error message indicates that GIPR.mt is, somehow, not a folder.

It does not exist after dxpy.download_folder. I think at this point, the issue might better be directed at UKB support, no?

Agreed. I’m not sure what’s going on. It does seem that the dx downloads are sending the files into Hadoop (which is good!). In the future, I’d plan to use the Hadoop URLs (no file:) instead of the file URLs.

Good luck!

EDIT: Upon further investigation ourselves, I might be wrong about downloading to the Hadoop filesystem.

I have at least ruled it out as a Hail issue. I figure more people are going to run into this when they have to move unto the Research Analysis Platform end of September.

2 Likes

Alright, here is the right way to go about it as per DNAnexus support:

import pyspark
import dxpy
import hail as hl

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

hl.init(sc=sc)

spark.sql("CREATE DATABASE test LOCATION  'dnax://'")

id = dxpy.find_one_data_object(name="test")["id"]
hail_matrixtable.write("dnax://" + id + "/matrix.mt")

mt = hl.read_matrix_table("dnax://database-XXX/matrix.mt")

Thank you again for the help.

Hey @jsmadsen,

I am glad you have found a solution! I’m somewhat surprised this works!

Just to be absolutely clear, this line:

mt = hl.read_matrix_table("dnax://database-XXX/matrix.mt")

successfully reads a matrix table? And you can successfully execute the following command?

mt.show()

Are you able to read VCF files stored in dx using the dnax protocol as well?

Hey Dan

It does indeed work. I have attached write_mt.log (279.9 KB) and read_mt.log (217.0 KB)

(new environment), in case those are interesting.

It does not seem I can use dnax for reading Bulk files but I may just be doing it wrong. At least, they are not included in the dispensed dataset.

vcf_file = ['dnax:///Bulk/Exome sequences/Population level exome OQFE variants, pVCF format/ukb23156_c19_b46_v1.vcf.gz',]

region = [
    hl.parse_locus_interval(
        "[chr19:45668221-chr19:45683722]"
    )
]

mts = hl.import_gvcfs(
    vcf_file,
    partitions=region,
    reference_genome="GRCh38",
    array_elements_required=False,
)

mt = mts[0]
mt.show()
FatalError: RuntimeException: Unexpected null value for databaseId

Java stack trace:
java.lang.RuntimeException: Unexpected null value for databaseId
	at com.dnanexus.hadoop.fs.DNAxFileSystem.checkDatabaseId(DNAxFileSystem.java:3061)
	at com.dnanexus.hadoop.fs.DNAxFileSystem.preparePathForApiRoute(DNAxFileSystem.java:423)
	at com.dnanexus.hadoop.fs.DNAxFileSystem.preparePathForApiRoute(DNAxFileSystem.java:413)
	at com.dnanexus.hadoop.fs.DNAxFileSystem.preparePathForApiRoute(DNAxFileSystem.java:405)
	at com.dnanexus.hadoop.fs.DNAxFileSystem.getFileStatus(DNAxFileSystem.java:2292)
	at com.dnanexus.hadoop.fs.DNAxFileSystem.open(DNAxFileSystem.java:536)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83)
	at is.hail.io.fs.FS$class.open(FS.scala:139)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS$class.open(FS.scala:148)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS$class.readLines(FS.scala:208)
	at is.hail.io.fs.HadoopFS.readLines(HadoopFS.scala:70)
	at is.hail.io.vcf.LoadVCF$.getHeaderLines(LoadVCF.scala:1278)
	at is.hail.io.vcf.VCFsReader.<init>(LoadVCF.scala:1829)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1$$anonfun$apply$11.apply(SparkBackend.scala:502)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1$$anonfun$apply$11.apply(SparkBackend.scala:501)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1.apply(SparkBackend.scala:501)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1.apply(SparkBackend.scala:500)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyImportVCFs(SparkBackend.scala:500)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.61-3c86d3ba497a
Error summary: RuntimeException: Unexpected null value for databaseId
3 Likes

I am also stuck in how to read the VCF files from the bulk in DNAnexus with a cluster environment (as a bucket). Please follow up with problem. Thanks!

I don’t think the instruction that DNANexus gave will be viable. The mounted folder is not visible by the workers. I think they are actually loading through the master node, which is very slow.

1 Like

Hi @henryhmo ,

I don’t have access to the DNA Nexus environment myself, so I can’t confirm or deny if this works. I agree, unless every Spark worker node has the VCFs network-mounted or stored in /mnt/project/..., this code will not work.

Hail cannot load data through the master node. If this code works for you, then it is indeed using the workers.

1 Like

My colleague says that the /mnt/project/… works, but just excruciatingly slow. I guess it is the way they gz the pVCF files was the problem (instead of bgz). Btw, How can we bgz a csv/tsv file outside of Hail? Thanks!

1 Like

Has your colleague found any faster solutions to loading the genotype data? Currently it appears we cannot use the BGEN files with hail because of the compression type (discussion here) and for pVCFs compression format also is an issue (only works with forse=true, which is slow and the docs say is highly discouraged). The only other format is PLINK which I’ll try.

Your plight is sympathetic – here’s a PR for zstd support:

Should be in a new release in a day or two if all goes well.

2 Likes

You’re looking for the tool called “block gzip”/bgzip

Have you already tried force_bgz=True? If the files are block-gzipped compressed but named .gz, force_bgz=True will treat them as block-gzipped.

I would be shocked if someone generated non-block compressed PVCFs. They would be unusable for a lot of important use-cases!

1 Like

Just ran it again and you’re right

1 Like