Dataproc error: java.io.IOException: Failed to create local dir


#1

I ran into a problem trying to import a large VCF into Hail 0.2 devel on a Dataproc cluster. I was using cloudtools to manage the cluster and started it this way:

cluster start cw-hail -p 6 --version devel --spark 2.2.0

My script starts like this:

import hail as hl

hl.init()

ds = hl.import_vcf("gs://my_big_vcf.vcf.gz", force_bgz=True)
print("Loaded " +  ds.count() + " variants")

This failed with the following output:

Submitting to cluster 'cw-hail'...
gcloud command:
gcloud dataproc jobs submit pyspark scripts/hail_import.py \
    --cluster=cw-hail \
    --files= \
    --properties=
Job [fb8c5330ef2945f69fa76c58635f583a] submitted.
Waiting for job output...
Running on Apache Spark version 2.2.1
SparkUI available at http://10.128.0.17:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version devel-39701f053c8a
NOTE: This is a beta version. Interfaces may change
  during the beta period. We recommend pulling
  the latest changes weekly.
[Stage 1:====================================================>(5282 + 1) / 5283]2018-08-28 20:39:38 Hail: INFO: Coerced sorted dataset
Traceback (most recent call last):
  File "/tmp/fb8c5330ef2945f69fa76c58635f583a/hail_import.py", line 8, in <module>
    print("Loaded " +  ds.count() + " variants")
  File "/home/hail/hail.zip/hail/matrixtable.py", line 2101, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/home/hail/hail.zip/hail/utils/java.py", line 210, in deco
hail.utils.java.FatalError: IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 50 in stage 2.0 failed 20 times, most recent failure: Lost task 50.19 in stage 2.0 (TID 5457, cw-hail-sw-zpnt.c.broad-dsde-methods.internal, executor 2): java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
	at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
	at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:148)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:592)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:156)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1089)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.fold(RDD.scala:1083)
	at is.hail.rvd.RVD$class.count(RVD.scala:361)
	at is.hail.rvd.OrderedRVD.count(OrderedRVD.scala:31)
	at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply$mcJ$sp(Interpret.scala:585)
	at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:585)
	at is.hail.expr.ir.Interpret$$anonfun$apply$1.apply(Interpret.scala:585)
	at scala.Option.getOrElse(Option.scala:121)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:585)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:49)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:24)
	at is.hail.variant.MatrixTable.countRows(MatrixTable.scala:650)
	at is.hail.variant.MatrixTable.count(MatrixTable.scala:648)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
	at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
	at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:148)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:592)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:156)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Hail version: devel-39701f053c8a
Error summary: IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1535480872324_0648/blockmgr-1193ec43-c523-423b-bd88-e40ee0102cca/24.
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [fb8c5330ef2945f69fa76c58635f583a] entered state [ERROR] while waiting for [DONE].

We started running into similar errors on Dataproc while working on GATK Spark tools last month when Google released Dataproc v. 1.3 – this issue manifested occasionally (perhaps depending on job size) on clusters running Dataproc image 1.2 after Dataproc 1.3 was released. The issue went away when when we upgraded to image 1.3 (which runs Spark 2.3.0).

Has anyone else hit these types of errors and found a workaround? Any plans to provide compiled Hail jars for Spark 2.3.0 and update cloudtools to use Dataproc image 1.3?

Thanks,

Chris


#2

TJ hit this a few weeks ago on 0.1 (he’s one of the few who hasn’t switched :wink:)

I have no idea what’s causing this, and it might be worth reporting to Google.

We do intend to switch out default deployment to Spark 2.3.0 now that there’s a Dataproc image for it. No concrete timeline on that, though.


#3

I’d dismissed it as an isolated instance before that might be tied to something weird our tools were doing, but since it cropped up here as well it probably is a good idea to report it to google. I opened an issue here:

https://issuetracker.google.com/issues/113360059

In the meantime, then, I guess the only way to run a job like this on dataproc is to recompile Hail with Spark 2.3.0?

Chris


#4

If Spark 2.3.0 seems to fix things, then that does seem like a good solution, yeah.

Here’s the dependency versions for that:

You’ll also need to compile on a Linux VM so the native libraries are built properly.