I ran into a problem trying to import a large VCF into Hail 0.2 devel on a Dataproc cluster. I was using cloudtools to manage the cluster and started it this way:
cluster start cw-hail -p 6 --version devel --spark 2.2.0
My script starts like this:
import hail as hl
hl.init()
ds = hl.import_vcf("gs://my_big_vcf.vcf.gz", force_bgz=True)
print("Loaded " + ds.count() + " variants")
This failed with the following output:
Submitting to cluster 'cw-hail'...
gcloud command:
gcloud dataproc jobs submit pyspark scripts/hail_import.py \
--cluster=cw-hail \
--files= \
--properties=
Job [fb8c5330ef2945f69fa76c58635f583a] submitted.
Waiting for job output...
Running on Apache Spark version 2.2.1
SparkUI available at http://10.128.0.17:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-39701f053c8a
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
[Stage 1:====================================================>(5282 + 1) / 5283]2018-08-28 20:39:38 Hail: INFO: Coerced sorted dataset
Traceback (most recent call last):
File "/tmp/fb8c5330ef2945f69fa76c58635f583a/hail_import.py", line 8, in <module>
print("Loaded " + ds.count() + " variants")
File "/home/hail/hail.zip/hail/matrixtable.py", line 2101, in count
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/hail/hail.zip/hail/utils/java.py", line 210, in deco
hail.utils.java.FatalError: IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 50 in stage 2.0 failed 20 times, most recent failure: Lost task 50.19 in stage 2.0 (TID 5457, cw-hail-sw-zpnt.c.broad-dsde-methods.internal, executor 2): java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:148)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:592)
at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at scala.Option.map(Option.scala:146)
at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:156)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1089)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1083)
at is.hail.rvd.RVD$class.count(RVD.scala:361)
at is.hail.rvd.OrderedRVD.count(OrderedRVD.scala:31)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply$mcJ$sp(Interpret.scala:585)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:585)
at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:585)
at scala.Option.getOrElse(Option.scala:121)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:585)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:49)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:24)
at is.hail.variant.MatrixTable.countRows(MatrixTable.scala:650)
at is.hail.variant.MatrixTable.count(MatrixTable.scala:648)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:148)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:592)
at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at scala.Option.map(Option.scala:146)
at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:156)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hail version: devel-39701f053c8a
Error summary: IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [fb8c5330ef2945f69fa76c58635f583a] entered state [ERROR] while waiting for [DONE].
We started running into similar errors on Dataproc while working on GATK Spark tools last month when Google released Dataproc v. 1.3 – this issue manifested occasionally (perhaps depending on job size) on clusters running Dataproc image 1.2 after Dataproc 1.3 was released. The issue went away when when we upgraded to image 1.3 (which runs Spark 2.3.0).
Has anyone else hit these types of errors and found a workaround? Any plans to provide compiled Hail jars for Spark 2.3.0 and update cloudtools to use Dataproc image 1.3?
Thanks,
Chris